BinSub: The Simple Essence of Polymorphic Type Inference for Machine Code (2409.01841v1)

Published 3 Sep 2024 in cs.PL

Abstract: Recovering high-level type information in binaries is a key task in reverse engineering and binary analysis. Binaries contain very little explicit type information. The structure of binary code is incredibly flexible allowing for ad-hoc subtyping and polymorphism. Prior work has shown that precise type inference on binary code requires expressive subtyping and polymorphism. Implementations of these type system features in a binary type inference algorithm have thus-far been too inefficient to achieve widespread adoption. Recent advances in traditional type inference have achieved simple and efficient principal type inference in an ML like language with subtyping and polymorphism through the framework of algebraic subtyping. BinSub, a new binary type inference algorithm, recognizes the connection between algebraic subtyping and the type system features required to analyze binaries effectively. Using this connection, BinSub achieves simple, precise, and efficient binary type inference. We show that BinSub maintains a similar precision to prior work, while achieving a 63x improvement in average runtime for 1568 functions. We also present a formalization of BinSub and show that BinSub's type system maintains the expressiveness of prior work.

Collections

Summary

The paper introduces a novel binary type inference method based on algebraic subtyping that achieves a 63x runtime improvement over prior approaches.
It refactors and optimizes an ML-like type system to recover high-level types from machine code while maintaining comparable precision.
Comprehensive evaluation on the ALLSTAR dataset validates BinSub's effectiveness in enhancing both efficiency and accuracy in binary analysis.

BinSub: The Simple Essence of Polymorphic Type Inference for Machine Code

BinSub introduces a novel binary type inference algorithm predicated on the principles of algebraic subtyping, as presented in recent advances in type inference for ML-like languages. The central challenge addressed by the paper is the efficient recovery of high-level type information from binaries, which historically has been an expensive and complex process due to the lack of explicit type information within machine code.

Core Contributions and Methodology

The primary contributions of BinSub are its ability to maintain precision akin to previous work while significantly improving runtime performance. The key insights and approach are enumerated as follows:

Algebraic Subtyping Framework: BinSub leverages the algebraic subtyping framework to provide expressive and efficient binary type inference. By recognizing the alignment between algebraic subtyping capabilities and the type system requirements for binary analysis, the algorithm simplifies the traditionally complex and computationally expensive task of type inference in binaries.
Optimized Type System: The type system in BinSub is equipped with essential features such as subtyping, contravariance of pointer stores, recursive types, and polymorphism. By refactoring Retypd's system within algebraic subtyping, BinSub avoids the inefficiencies and complexities of earlier approaches. Notably, the algorithm achieves a 63x improvement in average runtime over Retypd while maintaining similar precision.
Formalization and Validation: The paper formalizes BinSub's type system and demonstrates its expressiveness and correctness through a translation of Retypd constraints. This translation ensures that any subtyping judgements derived in Retypd can be equivalently represented and derived in BinSub.
Empirical Evaluation: BinSub's implementation was evaluated against Retypd using Angr on the ALLSTAR reverse engineering dataset. The comparison shows considerable performance enhancement without sacrificing precision, substantiating BinSub as a practical and scalable solution for binary type inference.

Implementation and Evaluation

Implementation

BinSub's implementation within the Angr framework involved several steps:

Constraint Generation: Constraints are generated from the intermediate representation (IR) of binary functions.
Bi-unification and Coalescing: The algorithm performs bi-unification and type coalescing, resulting in unconstrained types by substituting lower and upper bounds accordingly.
Type Simplification: Using automata-based minimization techniques inspired by MLSub, BinSub simplifies types effectively.
Type Lowering: The final step involves lowering BinSub types to C types, where recursive types and pointers are handled using specific heuristics to generate succinct and accurate type representations.

Empirical Results

The empirical evaluation of BinSub was conducted on a dataset of 1568 functions sampled from the ALLSTAR dataset. The notable findings include:

Type Inference Precision: Both BinSub and Typehoon achieved comparable type distances to the ground truth, reinforcing that BinSub does not compromise on accuracy.
Runtime Performance: BinSub demonstrated a substantial 63x reduction in runtime compared to Typehoon, underscoring its efficiency.

Implications and Future Work

BinSub represents a significant advancement in making binary type inference both efficient and precise. The reduction of binary type inference to algebraic subtyping paves the way for future integration of high-level advances in type inference directly into binary analysis.

Potential future work includes:

Enhanced Type Simplification: Further refinement of the type simplification rules could lead to even more optimized C type representations.
Improved Type Lowering Heuristics: Integrating precise rules to differentiate pointers from integers during lowering could enhance precision.

In conclusion, BinSub offers a compelling solution to the longstanding problem of efficient and precise type inference in binary code, thereby bridging the gap between traditional high-level type inference methods and the demands of binary analysis.