BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching (2401.11161v3)
Abstract: While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. First, BinaryAI trains a transformer-based model to produce function-level embeddings and obtain similar source functions for each binary function accordingly. Then by applying the link-time locality to facilitate function matching, BinaryAI detects the reused TPLs based on the ratio of matched source functions. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task. Specifically, our embedding model outperforms the state-of-the-art model CodeCMR, i.e., achieving 22.54% recall@1 and 0.34 MRR compared with 10.75% and 0.17 respectively. Additionally, BinaryAI outperforms all existing binary-to-source SCA tools in TPL detection, increasing the precision from 73.36% to 85.84% and recall from 59.81% to 64.98% compared with the well-recognized commercial SCA product.
- 2012. The DWARF Debugging Standard. https://dwarfstd.org.
- National Security Agency. 2023. Ghidra Software Reverse Engineering (SRE) Framework. https://ghidra-sre.org.
- Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. Association for Computational Linguistics, 2655–2668. https://doi.org/10.18653/v1/2021.naacl-main.211
- Amrita Pathak. 2022. Software Composition Analysis (SCA): Everything You Need to Know in 2022. https://geekflare.com/software-composition-analysis.
- Archlinux. 2021a. Arch linux. https://archlinux.org/packages/.
- Archlinux. 2021b. Arch User Repository. https://aur.archlinux.org/.
- Reliable third-party library detection in android and its security applications. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 356–367. https://doi.org/10.1145/2976749.2978333
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning. PMLR, 2397–2430.
- The National Vulnerability Database (NVD): Overview. https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=915172
- Neural reverse engineering of stripped binaries using augmented control flow graphs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–28. https://doi.org/10.1145/3428293
- Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 472–489. https://doi.org/10.1109/SP.2019.00003
- Imposing a memory management discipline on software deployment. In Proceedings. 26th International Conference on Software Engineering. IEEE, 583–592. https://doi.org/10.1109/ICSE.2004.1317480
- Identifying open-source license violation and 1-day security risk at large scale. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security. 2169–2185. https://doi.org/10.1145/3133956.3134048
- EleutherAI. 2023. Pythia-410M. https://huggingface.co/EleutherAI/pythia-410m.
- EVM. 2018. A Code Pirate’s Cutlass, Recovering Software Architecture from Embedded Binaries. REcon 2018.
- Functional code clone detection with syntax and semantics fusion learning. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 516–527. https://doi.org/10.1145/3395363.3397362
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020. 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
- GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=jLoC4ez43PZ
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
- Finding software license violations through binary code clone detection. In Proceedings of the 8th Working Conference on Mining Software Repositories. 63–72. https://doi.org/10.1145/3468744.3468752
- Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1613–1627.
- Contrastive Code Representation Learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 5954–5971. https://doi.org/10.18653/v1/2021.emnlp-main.482
- 1-to-1 or 1-to-n? Investigating the Effect of Function Inlining on Binary Similarity Analysis. ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 1–26.
- Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem: How Far Are We?. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. https://doi.org/10.1145/3551349.3560432
- Evaluating and improving hybrid fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 410–422.
- Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
- Julie Peterson. 2021. Software Composition Analysis Explained. https://www.mend.io/resources/blog/software-composition-analysis.
- Improving cross-platform binary analysis using representation learning via graph alignment. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 151–163. https://doi.org/10.1145/3533767.3534383
- Vuddy: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 595–614. https://doi.org/10.1109/SP.2017.62
- Optimal code motion: Theory and practice. ACM Transactions on Programming Languages and Systems (TOPLAS) 16, 4 (1994), 1117–1155.
- Libd: Scalable and precise third-party library detection in android markets. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 335–346. https://doi.org/10.1109/ICSE.2017.38
- LibAM: An Area Matching Framework for Detecting Third-party Libraries in Binaries. ArXiv abs/2305.04026 (2023). https://api.semanticscholar.org/CorpusID:258557875
- Xuezixiang Li. 2019. Learning Program-Wide Code Representations for Binary Diffing. Proceedings 2020 Network and Distributed System Security Symposium (2019).
- Steelix: program-state based binary fuzzing. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 627–637.
- α𝛼\alphaitalic_αDiff: Cross-Version Binary Code Similarity Detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE ’18). 667–678. https://doi.org/10.1145/3238147.3238199
- Demystifying the Vulnerability Propagation and Its Evolution via Dependency Trees in the NPM Ecosystem. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 672–684. https://doi.org/10.1145/3510003.3510142
- Learning Graph-based Code Representations for Source-level Functional Similarity Detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 345–357. https://doi.org/10.1109/ICSE48619.2023.00040
- DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1–28. https://doi.org/10.1145/3133908
- VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search. In Proceedings of the 2023 Network and Distributed Systems Security Symposium (NDSS).
- How Machine Learning Is Solving the Binary Function Similarity Problem. In 31st USENIX Security Symposium (USENIX Security 22). 2099–2116.
- Binpro: A tool for binary source code provenance. arXiv preprint arXiv:1711.00830 (2017).
- James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 32–38.
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Red Hat. 2022. What is software supply chain security? https://www.redhat.com/en/topics/security/what-is-software-supply-chain-security.
- Github Repository. 2023a. The LLVM Compiler Infrastructure. https://github.com/llvm/llvm-project.
- Github Repository. 2023b. Tree-sitter, a parser generator tool and incremental parsing library. https://github.com/tree-sitter/tree-sitter.
- Github Repository. 2023c. Zlib data compression library. https://github.com/madler/zlib.
- A Simple Recipe for Multilingual Grammatical Error Correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 702–707. https://doi.org/10.18653/v1/2021.acl-short.89
- Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. 1157–1168. https://doi.org/10.1145/2884781.2884877
- Scantist. 2023. Scantist Binary Analysis. https://scantist.io.
- Driller: Augmenting fuzzing through selective symbolic execution.. In NDSS, Vol. 16. 1–16.
- A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 4367–4374. https://doi.org/10.24963/ijcai.2022/606
- Synopsys. [n. d.]. What is software composition analysis. https://www.synopsys.com/glossary/what-is-software-composition-analysis.html.
- Synopsys. 2023. Black Duck Binary Analysis (BDBA). https://www.synopsys.com/software-integrity/security-testing/software-composition-analysis/binary-analysis.html.
- Libdx: A cross-platform and accurate system to detect third-party libraries in binary code. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 104–115. https://doi.org/10.1109/SANER48275.2020.9054845
- LibDB: An Effective and Efficient Framework for Detecting Third-Party Libraries in Binaries. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). 423–434. https://doi.org/10.1145/3524842.3528442
- Towards Understanding Third-party Library Dependency in C/C++ Ecosystem. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12. https://doi.org/10.1145/3551349.3560432
- JTrans: Jump-Aware Transformer for Binary Code Similarity Detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 1–13. https://doi.org/10.1145/3533767.3534367
- Lilian Weng. 2021. Contrastive Representation Learning. lilianweng.github.io (May 2021). https://lilianweng.github.io/posts/2021-05-31-contrastive/
- MOVERY: A Precise Approach for Modified Vulnerable Code Clone Discovery from Modified Open-Source Software Components. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 3037–3053. https://www.usenix.org/conference/usenixsecurity22/presentation/woo
- CENTRIS: A precise and scalable approach for identifying modified open-source software reuse. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 860–872. https://doi.org/10.1109/ICSE43902.2021.00083
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
- OSSFP: Precise and Scalable C/C++ Third-Party Library Detection using Fingerprinting Functions. In Proceedings of the 45th International Conference on Software Engineering.
- Enhancing Coverage-Guided Fuzzing via Phantom Program. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1037–1049.
- One fuzzing strategy to rule them all. In Proceedings of the 44th International Conference on Software Engineering. 1634–1645.
- Evaluating and improving neural program-smoothing-based fuzzing. In Proceedings of the 44th International Conference on Software Engineering. 847–858.
- Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). 1106–1118. https://doi.org/10.1145/3597926.3598121
- Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 363–376. https://doi.org/10.1145/3133956.3134018
- Interpretation-enabled software reuse detection based on a multi-level birthmark model. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 873–884. https://doi.org/10.1109/ICSE43902.2021.00084
- ModX: binary level partially imported third-party library detection via program modularization and semantic matching. In Proceedings of the 44th International Conference on Software Engineering. 1393–1405. https://doi.org/10.1145/3510003.3510627
- Codecmr: Cross-modal retrieval for function-level binary source code matching. Advances in Neural Information Processing Systems 33 (2020), 3872–3883.
- B2sfinder: detecting open-source software reuse in cots software. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1038–1049. https://doi.org/10.1109/ASE.2019.00100
- An extensive study on pre-trained models for program understanding and generation. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 39–51.
- Atvhunter: Reliable version detection of third-party libraries for vulnerability identification in android applications. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1695–1707. https://doi.org/10.1109/ICSE43902.2021.00150
- Libid: reliable identification of obfuscated third-party android libraries. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 55–65. https://doi.org/10.1145/3293882.3330563
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
- Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In Proceedings of the 2019 Network and Distributed Systems Security Symposium (NDSS).