- The paper introduces ELLF, a novel ELF-based format that embeds minimal compilation metadata to render binary disassembly decidable.
- It details a compiler-agnostic, oracle-driven methodology for recovering instruction boundaries, pointers, control-flow graphs, and memory structures with minimal overhead.
- Empirical evaluation on 200+ programs shows reliable symbolic lifting, a 27% binary size increase, and performance identical to baseline binaries.
Motivation and Problem Statement
Traditional binary executable formats, such as ELF, are foundational for software distribution but are inherently restrictive for post-compilation analysis and modification. The crux of the problem is the undecidability of binary disassembly—one cannot reliably recover instruction boundaries, control flow, or memory structure without information lost during compilation. Existing solutions like DWARF debugging information or symbol tables are insufficient, both for making disassembly decidable and for enabling robust, trustworthy binary rewriting or instrumentation. Moreover, distributing binaries with rich debug metadata poses confidentiality and security risks for closed-source vendors.
ELLF: Concept and Technical Approach
The paper introduces Executable, Linkable, and Liftable Format (ELLF), an ELF-compatible binary format augmented with minimal compilation metadata. ELLF aims to create a distribution format that sits between stripped binaries (maximal obfuscation, minimal analyzability) and open source (maximal transparency). The ELLF format includes only metadata strictly necessary to render the following key binary analysis problems decidable:
- Instruction boundary recovery: Deciding which bytes are instructions.
- Pointer disambiguation: Precisely identifying pointers and their targets.
- Control-flow graph (CFG) construction: Accurately recovering basic block and function boundaries.
- Stack and data memory structure recovery: Identifying the boundaries of independent stack objects and global variables.
The ELLF approach leverages "oracle" metadata, extracted solely at build time, which is minimal but complete with respect to making the above recovery tasks algorithmically decidable. These oracles are:
- Instruction Oracle: Set of valid instruction entry addresses.
- Pointer Oracle: Operand-level annotation of pointers in text and data segments.
- Text Oracle: Mapping of addresses to function entry, function exit, or basic block entry.
- Stack Oracle: Fine-grained stack object boundaries per function.
- Data Oracle: Object boundaries for global data.
The format and extraction pipeline are compiler/toolchain-agnostic, designed for transparent integration with standard compilation (LLVM/clang, for the prototype), and impose minimal overhead.
Implementation Details
Metadata is collected at various compilation stages using established and custom compiler/linker switches:
- Instruction boundaries and basic block ranges are derived from LLVM's
-fbasic-block-address-map output.
- Relocation information is obtained directly from object files and linker map files, sidestepping fragility in the final executable.
- Stack and global variable boundaries rely partly on DWARF, but ELLF tolerates DWARF incompleteness by coalescing regions conservatively.
- Exception support (CFI and LSDA metadata) is preserved by mapping exception table references to ELLF section labels.
The final ELLF metadata is injected as a separate ELF section, and significantly more compact than DWARF (17% of DWARF's size on average, with a 27% overall binary size increase compared to DWARF's 158%).
Empirical Evaluation
ELLF was evaluated on ~200 programs from the LLVM test suite, as well as larger real-world systems. The methodology included:
- Compiling to ELLF binaries.
- Disassembling with an ELLF-aware lifter, producing fully symbolized recompilable assembly.
- Recompiling to executable binaries and comparing functional correctness and performance with the original.
Key empirical results:
- All programs except one idiosyncratic case (
frame_layout) were lifted and recompiled without behavioral change; that exception was due to a rare linker pattern, not loss of metadata.
- The ELLF metadata increased binary size by an average of 27%, far lower than DWARF's overhead.
- The runtime performance is identical (within measurement noise) to baseline binaries; no statistically significant slowdowns or speedups were observed.
- The metadata generation increases compilation time by an average of 57%, the majority of which is spent in standard compiler/linker stages.
Comparison to Prior Work
Prevailing approaches to disassembler ground truth or binary rewriting, such as Ramblr, Egalito, or BOLT, either over-approximate (accepting unsoundness), attempt to instrument compilers directly (fragile, hard to maintain), or require the presence of full symbol/debug information. ELLF differs critically in that:
- Disassembly is rendered decidable: there is no heuristic or probabilistic inference at any lifting stage, and no soundness compromise is required.
- The format is strictly minimal, omitting any extraneous or proprietary information (e.g., source-level variable names, types).
- It is robust across optimization levels and complex binary constructs, including jump tables and exception handling.
Practical and Theoretical Implications
The implications of ELLF are significant for binary analysis, instrumentation, and secure software supply chain practices:
- For security and program analysis tools, ELLF enables robust automation of patching, instrumentation (e.g., fuzzing hooks), bounds checking, and diversification without the risk of disassembly errors or ambiguity.
- For software publishers, ELLF allows distribution of analyzable and maintainable binaries without source-level disclosure, balancing transparency and IP protection.
- In reverse engineering and decompilation, ELLF does not make recovery of high-level, source-like representations easier; the mapping from structured binary to human-readable code, type inference, variable recovery, and control flow structuring remain hard open problems.
- For platform security and reliability (e.g., as discussed within the OpenSSF SIG), ELLF provides a tractable means for ecosystem-wide improvement of updateability, maintainability, and forensic analyzability of distributed software artifacts.
Future Directions
Promising avenues for research and adoption include:
- Cross-platform extension: While ELLF was developed for ELF, the approach is directly portable to PE and Mach-O formats.
- Support for handwritten/inline assembly: While not the primary goal, future assembler enhancements could match ELLF's metadata guarantees for non-compiler-generated code regions.
- Integration with supply chain assurance and attestation frameworks, capitalizing on ELLF's balance of verifiability and confidentiality.
- Refinement of metadata compaction and handling of rare linker patterns (e.g., more thorough support for ambiguous section merging and less common switch lowering constructs).
Conclusion
This work rigorously formalizes and implements a principled middle ground between black-box binaries and open source: ELLF, an ELF-based binary format augmented with strictly sufficient metadata to render all relevant binary analysis and rewriting tasks decidable. The authors demonstrate that ELLF enables correct, complete, and efficient lifting, rewriting, and recompilation across a wide range of realistic programs, effectively closing the gap between the needs of binary analysis and the constraints of software confidentiality. This approach establishes a strong foundation for future systems in trustworthy binary distribution and automated software maintenance, impacting both research and software engineering practices going forward (2604.19628).