My hackweek project was influenced by a confluence of issues
i) For some time now we have known that the VBA project structure has some influence on OLE controls stored in a Microsoft document. Note: It seems the VBA project is created in this scenario regardless of whether any macro code is inserted. When exporting documents with such controls we can run into some problems. The type of problems range from controls that are unusable to some worrying & annoying nag dialogs (warning that “Controls may not be activated” or some such thing ). Typically this isn't an issue when roundtripping ( e.g. reading in a Microsoft document and saving it back again to the same format ) This problem is normally seen when exporting/converting a native Libreoffice format to a Microsoft format. In some cases like Powerpoint export we cannot export controls at all (rountrip is the exception). Now although our binary filters do read the VBA project, a fair amount of data is skipped, ignored or not read at all. It seems to export OLE controls successfully we need to synthesise a VBA project ( regardless of whether any macros are present or not ), to have a chance at doing that we need to understand the VBA structure more fully.
ii) There are some binary dumping tools for Libreoffice that dump content of various document formats but we don't have anything that dumps the VBA related records
iii) libgsf used to at one time in it's test suite have a program that extracted VBA modules from Microsoft documents however it only worked for Excel, libgsf isn't exactly universally available. In the absence of the vba extractor that used to be in libgsf it would be nice to be able to easily access the Module content without having to fire-up Libreoffice
iv) I wanted to learn python ( in a previous Hackweek about 2 years ago I had my first and only exposure to python programming, I have not used python since. )
So hear is what I did
a) I implemented the VBA compression & decompression algorithms, the need for the decompression algorithm is clear, alot of the streams in the VBA container (including the code Modules) are compressed. But.. being able to compress the streams will also allow for a project structure to be manually constructed for insertion into a document for testing by Excel/Word etc.
b) Wrote some simple compress.py & decompress.py cmd line tools to allow stand alone streams to be compressed/decompressed
c) Wrote a vbadump.py tool, this tool searches for and locates the VBA project in a Microsoft compound OLE document. The 'dir' stream is the key information stream in a VBA project, it is stored using the VBA compression algorithm mentioned above, the vbadump tool decompresses, parses and prints all the records contained in that stream, information from the dir stream is also used to identify,locate, extract and decompress the Module streams. Currently the tool does not parse or dump any records or information for Userforms.
d) Updated and refactored 'oletool.py' ( from previous Hackweek ) to split out some functionality needed by the vbadump tool also fixed some bugs.
e) Created a new module vbahelper.py used by vbadump.py, decompress.py & compress.py
f) I had started in the previously mentioned Hackweek a tool to inspect/modify the content of OLE compound documents ( the format used for Word, Excel etc. binary docs ) in a manner not unlike zip. I failed in that attempt, the branch is still there though so for the remaining Hackweek time I tried to get to grips with that, unfortunately only managed to fix just a minor bug
bits & pieces
The git log
The files: vbadump.py, compress.py, decompress.py, vbahelper.py ( note: due to dependecies these files need to be run from the mso_dumper tree )
sample output of
./vbadump.py Suse-puzzler.xls
Sample Module1 ( uncompressed VBA module to compress )
Sample use of compress/decompress:
cat Module1 > Module1.deflate
cat Module1 | ./compress.py | ./decompress.py > Module1.roundtrip # and Module1.roundtrip should match Module1