Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adding Compilation Metadata To Binaries To Make Disassembly Decidable

Published 21 Apr 2026 in cs.CR and cs.PL | (2604.19628v1)

Abstract: The binary executable format is the standard method for distributing and executing software. Yet, it is also as opaque a representation of software as can be. If the binary format were augmented with metadata that provides security-relevant information, such as which data is intended by the compiler to be executable instructions, or how memory regions are expected to be bounded, that would dramatically improve the safety and maintainability of software. In this paper, we propose a binary format that is a middle ground between a stripped black-box binary and open source. We provide a tool that generates metadata capturing the compiler's intent and inserts it into the binary. This metadata enables lifting to a correct and recompilable higher-level representation and makes analysis and instrumentation more reliable. Our evaluation shows that adding metadata does not affect runtime behavior or performance. Compared to DWARF, our metadata is roughly 17% of its size. We validate correctness by compiling a comprehensive set of real-world C and C++ binaries and demonstrating that they can be lifted, instrumented, and recompiled without altering their behavior.

Summary

  • The paper introduces ELLF, a novel ELF-based format that embeds minimal compilation metadata to render binary disassembly decidable.
  • It details a compiler-agnostic, oracle-driven methodology for recovering instruction boundaries, pointers, control-flow graphs, and memory structures with minimal overhead.
  • Empirical evaluation on 200+ programs shows reliable symbolic lifting, a 27% binary size increase, and performance identical to baseline binaries.

Decidable Disassembly via ELLF: Embedding Compilation Metadata in Binaries

Motivation and Problem Statement

Traditional binary executable formats, such as ELF, are foundational for software distribution but are inherently restrictive for post-compilation analysis and modification. The crux of the problem is the undecidability of binary disassembly—one cannot reliably recover instruction boundaries, control flow, or memory structure without information lost during compilation. Existing solutions like DWARF debugging information or symbol tables are insufficient, both for making disassembly decidable and for enabling robust, trustworthy binary rewriting or instrumentation. Moreover, distributing binaries with rich debug metadata poses confidentiality and security risks for closed-source vendors.

ELLF: Concept and Technical Approach

The paper introduces Executable, Linkable, and Liftable Format (ELLF), an ELF-compatible binary format augmented with minimal compilation metadata. ELLF aims to create a distribution format that sits between stripped binaries (maximal obfuscation, minimal analyzability) and open source (maximal transparency). The ELLF format includes only metadata strictly necessary to render the following key binary analysis problems decidable:

  • Instruction boundary recovery: Deciding which bytes are instructions.
  • Pointer disambiguation: Precisely identifying pointers and their targets.
  • Control-flow graph (CFG) construction: Accurately recovering basic block and function boundaries.
  • Stack and data memory structure recovery: Identifying the boundaries of independent stack objects and global variables.

The ELLF approach leverages "oracle" metadata, extracted solely at build time, which is minimal but complete with respect to making the above recovery tasks algorithmically decidable. These oracles are:

  • Instruction Oracle: Set of valid instruction entry addresses.
  • Pointer Oracle: Operand-level annotation of pointers in text and data segments.
  • Text Oracle: Mapping of addresses to function entry, function exit, or basic block entry.
  • Stack Oracle: Fine-grained stack object boundaries per function.
  • Data Oracle: Object boundaries for global data.

The format and extraction pipeline are compiler/toolchain-agnostic, designed for transparent integration with standard compilation (LLVM/clang, for the prototype), and impose minimal overhead.

Implementation Details

Metadata is collected at various compilation stages using established and custom compiler/linker switches:

  • Instruction boundaries and basic block ranges are derived from LLVM's -fbasic-block-address-map output.
  • Relocation information is obtained directly from object files and linker map files, sidestepping fragility in the final executable.
  • Stack and global variable boundaries rely partly on DWARF, but ELLF tolerates DWARF incompleteness by coalescing regions conservatively.
  • Exception support (CFI and LSDA metadata) is preserved by mapping exception table references to ELLF section labels.

The final ELLF metadata is injected as a separate ELF section, and significantly more compact than DWARF (17% of DWARF's size on average, with a 27% overall binary size increase compared to DWARF's 158%).

Empirical Evaluation

ELLF was evaluated on ~200 programs from the LLVM test suite, as well as larger real-world systems. The methodology included:

  • Compiling to ELLF binaries.
  • Disassembling with an ELLF-aware lifter, producing fully symbolized recompilable assembly.
  • Recompiling to executable binaries and comparing functional correctness and performance with the original.

Key empirical results:

  • All programs except one idiosyncratic case (frame_layout) were lifted and recompiled without behavioral change; that exception was due to a rare linker pattern, not loss of metadata.
  • The ELLF metadata increased binary size by an average of 27%, far lower than DWARF's overhead.
  • The runtime performance is identical (within measurement noise) to baseline binaries; no statistically significant slowdowns or speedups were observed.
  • The metadata generation increases compilation time by an average of 57%, the majority of which is spent in standard compiler/linker stages.

Comparison to Prior Work

Prevailing approaches to disassembler ground truth or binary rewriting, such as Ramblr, Egalito, or BOLT, either over-approximate (accepting unsoundness), attempt to instrument compilers directly (fragile, hard to maintain), or require the presence of full symbol/debug information. ELLF differs critically in that:

  • Disassembly is rendered decidable: there is no heuristic or probabilistic inference at any lifting stage, and no soundness compromise is required.
  • The format is strictly minimal, omitting any extraneous or proprietary information (e.g., source-level variable names, types).
  • It is robust across optimization levels and complex binary constructs, including jump tables and exception handling.

Practical and Theoretical Implications

The implications of ELLF are significant for binary analysis, instrumentation, and secure software supply chain practices:

  • For security and program analysis tools, ELLF enables robust automation of patching, instrumentation (e.g., fuzzing hooks), bounds checking, and diversification without the risk of disassembly errors or ambiguity.
  • For software publishers, ELLF allows distribution of analyzable and maintainable binaries without source-level disclosure, balancing transparency and IP protection.
  • In reverse engineering and decompilation, ELLF does not make recovery of high-level, source-like representations easier; the mapping from structured binary to human-readable code, type inference, variable recovery, and control flow structuring remain hard open problems.
  • For platform security and reliability (e.g., as discussed within the OpenSSF SIG), ELLF provides a tractable means for ecosystem-wide improvement of updateability, maintainability, and forensic analyzability of distributed software artifacts.

Future Directions

Promising avenues for research and adoption include:

  • Cross-platform extension: While ELLF was developed for ELF, the approach is directly portable to PE and Mach-O formats.
  • Support for handwritten/inline assembly: While not the primary goal, future assembler enhancements could match ELLF's metadata guarantees for non-compiler-generated code regions.
  • Integration with supply chain assurance and attestation frameworks, capitalizing on ELLF's balance of verifiability and confidentiality.
  • Refinement of metadata compaction and handling of rare linker patterns (e.g., more thorough support for ambiguous section merging and less common switch lowering constructs).

Conclusion

This work rigorously formalizes and implements a principled middle ground between black-box binaries and open source: ELLF, an ELF-based binary format augmented with strictly sufficient metadata to render all relevant binary analysis and rewriting tasks decidable. The authors demonstrate that ELLF enables correct, complete, and efficient lifting, rewriting, and recompilation across a wide range of realistic programs, effectively closing the gap between the needs of binary analysis and the constraints of software confidentiality. This approach establishes a strong foundation for future systems in trustworthy binary distribution and automated software maintenance, impacting both research and software engineering practices going forward (2604.19628).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.