Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph (2506.06161v1)

Published 6 Jun 2025 in cs.CR and cs.SE

Abstract: Binary code similarity analysis (BCSA) serves as a core technique for binary analysis tasks such as vulnerability detection. While current graph-based BCSA approaches capture substantial semantics and show strong performance, their performance suffers under code obfuscation due to the unstable control flow. To address this issue, we develop ORCAS, an Obfuscation-Resilient BCSA model based on Dominance Enhanced Semantic Graph (DESG). The DESG is an original binary code representation, capturing more binaries' implicit semantics without control flow structure, including inter-instruction relations, inter-basic block relations, and instruction-basic block relations. ORCAS robustly scores semantic similarity across binary functions from different obfuscation options, optimization levels, and instruction set architectures. Extensive evaluation on the BinKit dataset shows ORCAS significantly outperforms eight baselines, achieving an average 12.1% PR-AUC gain when using combined three obfuscation options compared to the state-of-the-art approaches. Furthermore, ORCAS improves recall by up to 43% on an original obfuscated real-world vulnerability dataset, which we released to facilitate future research.

Summary

The paper introduces ORCAS, a new model that uses Dominance Enhanced Semantic Graph (DESG) to enhance binary code similarity analysis under obfuscation.
It employs Gated Graph Neural Networks with multi-head attention and margin-based loss to capture stable semantic relations beyond traditional control flow graphs.
Empirical evaluations demonstrate ORCAS outperforms baselines with a 12.1% PR-AUC gain and up to 43% recall improvement on obfuscated datasets.

Obfuscation-Resilient Binary Code Similarity Analysis

Binary Code Similarity Analysis (BCSA) is an essential technique in various domains, including vulnerability detection, malware identification, and software plagiarism detection. Recent advancements have attempted to improve BCSA's robustness to code obfuscation techniques, which obscure control flow to evade reverse engineering efforts. The paper "Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph" addresses these challenges by introducing a new BCSA model, ORCAS, which is built upon a novel representation named Dominance Enhanced Semantic Graph (DESG).

Dominance Enhanced Semantic Graph (DESG)

DESG is a pivotal innovation in this research. It captures implicit semantic relations within binary code without relying on traditional control flow graphs (CFGs). There are three types of relations that DESG highlights: inter-basic block relations using dominance and post-dominance, inter-instruction relations including data and effect, and instruction-basic block containment relations. Unlike conventional CFGs, which are susceptible to modifications under obfuscation like bogus control flow (BCF) and control flow flattening (FLA), dominance relations exhibit notable stability. This inherent stability facilitates a reliable representation of binary code that withstands obfuscation techniques.

Model Architecture and Training Strategies

ORCAS employs Gated Graph Neural Networks (GGNN) to harness the semantic information encapsulated in DESG. The GGNN uses a multi-head attention-based pooling mechanism, which aids in emphasizing critical nodes and edges during the embedding process. The training process incorporates margin-based pairwise loss with distance-weighted negative sampling, enhancing ORCAS's capability to discern semantic similarities amongst function pairs even under various obfuscation options.

Numerical Results and Comparative Analysis

The empirical evaluation of ORCAS presents convincing evidence of its effectiveness. Evaluated on the BinKit dataset, ORCAS consistently outperforms eight baseline models across multiple criteria such as Precision-Recall AUC, Recall@k, and Mean Reciprocal Rank (MRR). Specifically, ORCAS demonstrates a significant uplift, with a PR-AUC gain of 12.1% on combined obfuscation options relative to state-of-the-art approaches. Recall improvement reaches up to 43% when applied to the obfuscated real-world vulnerability dataset constructed by the authors.

Implications and Future Directions

This paper provides a substantial contribution towards more robust BCSA in the presence of obfuscation. By shifting focus from control flow dependency to dominance relations, ORCAS establishes new possibilities for further research. The obfuscated real-world vulnerability dataset amiably released by the authors will likely galvanize additional innovations and benchmarks in BCSA. Looking forward, incorporating finer granularity in semantic representations and enhancing cross-architectural learning capabilities in BCSA could advance the field further.

The theoretical implications extend beyond immediate performance improvements. The introduction of DESG represents a paradigm shift in how researchers may approach semantics in obfuscated code. Practically, this could be instrumental in more accurately identifying security vulnerabilities within increasingly complex software landscapes.

PDF Markdown

Tweets

https://twitter.com/ComputerPapers/status/1932207170034049469