whole-program-llvm Improvements

whole-program-llvm is a set of scripts that I started to compile programs and libraries into single LLVM bitcode files. The scripts stand in for a compiler (either clang or gcc) and compile each file in a program or library twice: once as the build system intended and once as LLVM bitcode. Another script, extract-bc can then be run over the final binary to link the LLVM bitcode files into a single large bitcode file. This is very useful for performing whole program analysis (hence the name of the scripts).

Recently, contributions by Ian Mason and Bruno Dutertre made whole-program-llvm work on OS X. This has been requested by users many times, and I just never had the time to figure out the details. whole-program-llvm uses some details of ELF executables on Linux to do its work. In particular, it stores generated bitcode files inside of ELF object files as a special section during compilation. It takes this slightly odd approach to be robust against strange build systems. Some build systems move intermediate object files around and delete temporary directories, among other seemingly odd things. Keeping the bitcode files attached to actual build artifacts (i.e., object files) means that we will end up with all of the bitcode relevant to the executable being built. When all of the object files are linked together, the system linker combines all of the ELF sections of the same name from each input object file. This lets us pull out the contents of a single ELF section from the final binary and just link everything together. Unfortunately, relying on ELF details does not work on OS X, which uses Mach-O binaries instead. Ian and Bruno added code paths to the scripts to use the equivalent Mach-O tools on OS X (mostly fancy invocations of otool, I believe).

Their contributions also made the scripts robust enough to be able to compile the entire FreeBSD base system and kernel. If you ever wanted to run a static analysis over an entire OS, this is your chance. Robustness is an issue with these scripts because they must intercept compiler arguments and tweak them a bit, but, more importantly, recognize a few special arguments. Part of this is recognizing input file names to know what file to generate bitcode from. To do that, the scripts need to know which command line flags take arguments that need to be ignored. Unfortunately, gcc-compatible compilers (like clang) support a great many flags. Not all of these flags are documented. Furthermore, these compilers are somewhat robust against nonsensical combinations of flags. While that is useful for users, it makes writing a compatible driver program difficult, since the documentation does not tell the whole story about what flags are allowed together and their precise meaning. This comes up frequently with autotools and libtool-based build systems, which seem to throw together random combinations of compiler flags until they get working binary. Ian and Bruno taught whole-program-llvm about enough of the compiler flags found in the wild to compile all of FreeBSD base.

I hope people continue to find these scripts useful – now on OS X!