FASER: Binary Code Similarity Search through the use of Intermediate Representations (2310.03605v3)
Abstract: Being able to identify functions of interest in cross-architecture software is useful whether you are analysing for malware, securing the software supply chain or conducting vulnerability research. Cross-Architecture Binary Code Similarity Search has been explored in numerous studies and has used a wide range of different data sources to achieve its goals. The data sources typically used draw on common structures derived from binaries such as function control flow graphs or binary level call graphs, the output of the disassembly process or the outputs of a dynamic analysis approach. One data source which has received less attention is binary intermediate representations. Binary Intermediate representations possess two interesting properties: they are cross architecture by their very nature and encode the semantics of a function explicitly to support downstream usage. Within this paper we propose Function as a String Encoded Representation (FASER) which combines long document transformers with the use of intermediate representations to create a model capable of cross architecture function search without the need for manual feature engineering, pre-training or a dynamic analysis step. We compare our approach against a series of baseline approaches for two tasks; A general function search task and a targeted vulnerability search task. Our approach demonstrates strong performance across both tasks, performing better than all baseline approaches.
- National Security Agency. ghidra. https://github.com/NationalSecurityAgency/ghidra/tree/master, 2019.
- Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
- Josh Collyer. bin2ml. https://github.com/br0kej/bin2ml/, 2023.
- Bert: Pre-training of deep bidirectional transformers for language understanding. 2018.
- Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 472–489. IEEE, 2019.
- Cross-language binary-source code matching with intermediate representations. In 2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 601–612, 2022.
- In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
- HugginFace. Longformer. https://huggingface.co/docs/transformers/model_doc/longformer, 2021.
- Revisiting binary code similarity analysis using interpretable feature engineering and lessons learned. IEEE Transactions on Software Engineering, 2022.
- Adam: A method for stochastic optimization, 2017.
- LLVM: A compilation framework for lifelong program analysis and transformation. pages 75–88, San Jose, CA, USA, Mar 2004.
- Palmtree: learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3236–3251, 2021.
- Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning, pages 3835–3845. PMLR, 2019.
- How machine learning is solving the binary function similarity problem. page 2099–2116, 2022.
- Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 309–329. Springer, 2019.
- Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680, 2020.
- Cross-architecture bug search in binary executables. In 2015 IEEE Symposium on Security and Privacy, pages 709–724. IEEE, 2015.
- Firmalice - automatic detection of authentication bypass vulnerabilities in binary firmware. 2015.
- Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6398–6407, 2020.
- Radare2 Team. Radare2 github repository. https://github.com/radare/radare2, 2017.
- jtrans: Jump-aware transformer for binary code similarity. arXiv preprint arXiv:2205.12713, 2022.
- Enhancing dnn-based binary code function search with low-cost equivalence checking. IEEE Transactions on Software Engineering, 49(1):226–250, 2022.
- Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 363–376, 2017.
- Asteria-pro: Enhancing deep-learning based binary code similarity detection by incorporating domain knowledge. ACM Transactions on Software Engineering and Methodology, 2023.
- Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv preprint arXiv:1808.04706, 2018.