Probably one of the least understood executable file formats is Mach-O, short for the Mach Object File Format, this was originally designed for NexTSTEP systems and later adopted by MacOS. When researching this file format there were a lot of documents that were misleading and the documents we did find were pretty old, prompting us to write this post. We’ve mirrored two docs (one and two) we found particularly helpful, although a little outdated. This post will also not cover Universal binaries, otherwise known as Fat Mach-O binaries, which can hold multiple Mach-O objects for different CPU architectures.
While there are limited tools for investigating Mach-O files, there are some important ones you will want in your toolbox. One of the most useful tools you can use for parsing and visualizing a Mach-O file is MachOExplorer. otool is also a very useful native tool when investigating these files. Finally, bintriage is an interface to our custom debug library and our command line utility for investigating these files.
Intro / Background
The Mach-O file format on MacOS started with NexTSTEP CPUs targeting CMU’s Mach kernel design and Steve Jobs winning his way back into Apple’s good graces. The merger of the computing models, known as the Apple-Intel transition, gave way to multi-architecture universal file formats, which will be the subject of a follow up post. Out of those came the general Mach-O, intel 64 bit executable binary format for MacOS. That Mach-O executable file is the modern executable on MacOS (inside all Apps), and will be subject of our deep dive today.
High-Level Overview of the File Structure
Mach-O files themselves are very easy to parse, as almost every field is 4 bytes or a multiple of 4 bytes in length. Only larger abstract structures within the file format have offsets and lengths which must be observed, and even fewer data structures have required alignments. If I were to draw a diagram of the file structure based on my interactions and understanding of critical components, it would look like:
The header is extremely important, it helps provide data such as the magic bytes, the cpu type, and the subcpu type, which indicate the architecture and exact model cpu the binary is for. There is also the type, a 4 byte field that says whether this is an object file, a dynamic library (dylib), or an executable Mach-O file. Next is the ncmd and cmdsz fields, indicating the number of load commands and total size of the load commands. Finally there is the flags field, which is 4 bytes of bitwise flags specifying special linker options. At the end of the header is a reserved 4 byte space.
Probably the most important part of the Mach-O file structure are the load commands. Our debug library parses all of the critical structures required to rebuild a Mach-O file, however there are still several load commands we don’t parse but rather copy directly over. Each load command structure varies and needs to be parsed differently based on the type of load commands. Some load commands are self contained, telling the loader how to load dynamic libraries within the load command itself, where as other load commands reference other structures of data contained within the file, such as sections and tables that get loaded into segments. These are what truly represent how the file is mapped into virtual memory.
Segments with Sections
Several segments are loaded after and corresponding to the load commands, with their respective sections. This is where a lot of the machine executable code, variables, and pointers to various functions from program code come from. Sections from the _TEXT segment and _DATA segment will almost always be present, while other segments and sections may be optional, depending on the various libraries they call and how they are compiled.
Some segments are only present with certain Mach-O binary types, based on if the load commands are in place or not. An example of an optional segment would be a _DWARF segment and sections, which are used in debugging. We will likely write more on DWARF sections in a later post as there is often similar optional debug data present in ELF files.
The _LINKEDIT segment is the final segment in most Mach-O binaries and contains a number of critical components, such as the dynamic loader info, the symbol table, the string table, the dynamic symbol table, the digital signature, and potentially even more properties. This is one of the most important segments and is unfortunately left out of a lot of documentation, but make no mistake, this segment is critical for the binaries to run.
Those are the major parts of the Mach-O file format as I understood them and worked with them to recreate working executable files. Stay tuned for a follow up post covering Fat / Universal Mach-O binaries, which act as a wrapper for multiple Mach-O binaries of different architectures. I hope this information is useful to people, and if there are questions or discussions lets talk about them in the comments and forums! If I missed anything major please let me know, and pull requests are definitely welcome on our Binject projects!