- The paper introduces a novel binary type inference method based on algebraic subtyping that achieves a 63x runtime improvement over prior approaches.
- It refactors and optimizes an ML-like type system to recover high-level types from machine code while maintaining comparable precision.
- Comprehensive evaluation on the ALLSTAR dataset validates BinSub's effectiveness in enhancing both efficiency and accuracy in binary analysis.
BinSub: The Simple Essence of Polymorphic Type Inference for Machine Code
BinSub introduces a novel binary type inference algorithm predicated on the principles of algebraic subtyping, as presented in recent advances in type inference for ML-like languages. The central challenge addressed by the paper is the efficient recovery of high-level type information from binaries, which historically has been an expensive and complex process due to the lack of explicit type information within machine code.
Core Contributions and Methodology
The primary contributions of BinSub are its ability to maintain precision akin to previous work while significantly improving runtime performance. The key insights and approach are enumerated as follows:
- Algebraic Subtyping Framework: BinSub leverages the algebraic subtyping framework to provide expressive and efficient binary type inference. By recognizing the alignment between algebraic subtyping capabilities and the type system requirements for binary analysis, the algorithm simplifies the traditionally complex and computationally expensive task of type inference in binaries.
- Optimized Type System: The type system in BinSub is equipped with essential features such as subtyping, contravariance of pointer stores, recursive types, and polymorphism. By refactoring Retypd's system within algebraic subtyping, BinSub avoids the inefficiencies and complexities of earlier approaches. Notably, the algorithm achieves a 63x improvement in average runtime over Retypd while maintaining similar precision.
- Formalization and Validation: The paper formalizes BinSub's type system and demonstrates its expressiveness and correctness through a translation of Retypd constraints. This translation ensures that any subtyping judgements derived in Retypd can be equivalently represented and derived in BinSub.
- Empirical Evaluation: BinSub's implementation was evaluated against Retypd using Angr on the ALLSTAR reverse engineering dataset. The comparison shows considerable performance enhancement without sacrificing precision, substantiating BinSub as a practical and scalable solution for binary type inference.
Implementation and Evaluation
Implementation
BinSub's implementation within the Angr framework involved several steps:
- Constraint Generation: Constraints are generated from the intermediate representation (IR) of binary functions.
- Bi-unification and Coalescing: The algorithm performs bi-unification and type coalescing, resulting in unconstrained types by substituting lower and upper bounds accordingly.
- Type Simplification: Using automata-based minimization techniques inspired by MLSub, BinSub simplifies types effectively.
- Type Lowering: The final step involves lowering BinSub types to C types, where recursive types and pointers are handled using specific heuristics to generate succinct and accurate type representations.
Empirical Results
The empirical evaluation of BinSub was conducted on a dataset of 1568 functions sampled from the ALLSTAR dataset. The notable findings include:
- Type Inference Precision: Both BinSub and Typehoon achieved comparable type distances to the ground truth, reinforcing that BinSub does not compromise on accuracy.
- Runtime Performance: BinSub demonstrated a substantial 63x reduction in runtime compared to Typehoon, underscoring its efficiency.
Implications and Future Work
BinSub represents a significant advancement in making binary type inference both efficient and precise. The reduction of binary type inference to algebraic subtyping paves the way for future integration of high-level advances in type inference directly into binary analysis.
Potential future work includes:
- Enhanced Type Simplification: Further refinement of the type simplification rules could lead to even more optimized C type representations.
- Improved Type Lowering Heuristics: Integrating precise rules to differentiate pointers from integers during lowering could enhance precision.
In conclusion, BinSub offers a compelling solution to the longstanding problem of efficient and precise type inference in binary code, thereby bridging the gap between traditional high-level type inference methods and the demands of binary analysis.