Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BinaryAI: Binary Software Composition Analysis via Intelligent Binary Source Code Matching (2401.11161v3)

Published 20 Jan 2024 in cs.SE

Abstract: While third-party libraries are extensively reused to enhance productivity during software development, they can also introduce potential security risks such as vulnerability propagation. Software composition analysis, proposed to identify reused TPLs for reducing such risks, has become an essential procedure within modern DevSecOps. As one of the mainstream SCA techniques, binary-to-source SCA identifies the third-party source projects contained in binary files via binary source code matching, which is a major challenge in reverse engineering since binary and source code exhibit substantial disparities after compilation. The existing binary-to-source SCA techniques leverage basic syntactic features that suffer from redundancy and lack robustness in the large-scale TPL dataset, leading to inevitable false positives and compromised recall. To mitigate these limitations, we introduce BinaryAI, a novel binary-to-source SCA technique with two-phase binary source code matching to capture both syntactic and semantic code features. First, BinaryAI trains a transformer-based model to produce function-level embeddings and obtain similar source functions for each binary function accordingly. Then by applying the link-time locality to facilitate function matching, BinaryAI detects the reused TPLs based on the ratio of matched source functions. Our experimental results demonstrate the superior performance of BinaryAI in terms of binary source code matching and the downstream SCA task. Specifically, our embedding model outperforms the state-of-the-art model CodeCMR, i.e., achieving 22.54% recall@1 and 0.34 MRR compared with 10.75% and 0.17 respectively. Additionally, BinaryAI outperforms all existing binary-to-source SCA tools in TPL detection, increasing the precision from 73.36% to 85.84% and recall from 59.81% to 64.98% compared with the well-recognized commercial SCA product.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. 2012. The DWARF Debugging Standard. https://dwarfstd.org.
  2. National Security Agency. 2023. Ghidra Software Reverse Engineering (SRE) Framework. https://ghidra-sre.org.
  3. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021. Association for Computational Linguistics, 2655–2668. https://doi.org/10.18653/v1/2021.naacl-main.211
  4. Amrita Pathak. 2022. Software Composition Analysis (SCA): Everything You Need to Know in 2022. https://geekflare.com/software-composition-analysis.
  5. Archlinux. 2021a. Arch linux. https://archlinux.org/packages/.
  6. Archlinux. 2021b. Arch User Repository. https://aur.archlinux.org/.
  7. Reliable third-party library detection in android and its security applications. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 356–367. https://doi.org/10.1145/2976749.2978333
  8. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning. PMLR, 2397–2430.
  9. The National Vulnerability Database (NVD): Overview. https://tsapps.nist.gov/publication/get_pdf.cfm?pub_id=915172
  10. Neural reverse engineering of stripped binaries using augmented control flow graphs. Proceedings of the ACM on Programming Languages 4, OOPSLA (2020), 1–28. https://doi.org/10.1145/3428293
  11. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 472–489. https://doi.org/10.1109/SP.2019.00003
  12. Imposing a memory management discipline on software deployment. In Proceedings. 26th International Conference on Software Engineering. IEEE, 583–592. https://doi.org/10.1109/ICSE.2004.1317480
  13. Identifying open-source license violation and 1-day security risk at large scale. In Proceedings of the 2017 ACM SIGSAC Conference on computer and communications security. 2169–2185. https://doi.org/10.1145/3133956.3134048
  14. EleutherAI. 2023. Pythia-410M. https://huggingface.co/EleutherAI/pythia-410m.
  15. EVM. 2018. A Code Pirate’s Cutlass, Recovering Software Architecture from Embedded Binaries. REcon 2018.
  16. Functional code clone detection with syntax and semantics fusion learning. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 516–527. https://doi.org/10.1145/3395363.3397362
  17. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020. 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  18. GraphCodeBERT: Pre-training Code Representations with Data Flow. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=jLoC4ez43PZ
  19. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.
  20. Finding software license violations through binary code clone detection. In Proceedings of the 8th Working Conference on Mining Software Repositories. 63–72. https://doi.org/10.1145/3468744.3468752
  21. Pangolin: Incremental hybrid fuzzing with polyhedral path abstraction. In 2020 IEEE Symposium on Security and Privacy (SP). IEEE, 1613–1627.
  22. Contrastive Code Representation Learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 5954–5971. https://doi.org/10.18653/v1/2021.emnlp-main.482
  23. 1-to-1 or 1-to-n? Investigating the Effect of Function Inlining on Binary Similarity Analysis. ACM Transactions on Software Engineering and Methodology 32, 4 (2023), 1–26.
  24. Third-Party Library Dependency for Large-Scale SCA in the C/C++ Ecosystem: How Far Are We?. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis. https://doi.org/10.1145/3551349.3560432
  25. Evaluating and improving hybrid fuzzing. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 410–422.
  26. Billion-scale similarity search with gpus. IEEE Transactions on Big Data 7, 3 (2019), 535–547.
  27. Julie Peterson. 2021. Software Composition Analysis Explained. https://www.mend.io/resources/blog/software-composition-analysis.
  28. Improving cross-platform binary analysis using representation learning via graph alignment. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 151–163. https://doi.org/10.1145/3533767.3534383
  29. Vuddy: A scalable approach for vulnerable code clone discovery. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE, 595–614. https://doi.org/10.1109/SP.2017.62
  30. Optimal code motion: Theory and practice. ACM Transactions on Programming Languages and Systems (TOPLAS) 16, 4 (1994), 1117–1155.
  31. Libd: Scalable and precise third-party library detection in android markets. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 335–346. https://doi.org/10.1109/ICSE.2017.38
  32. LibAM: An Area Matching Framework for Detecting Third-party Libraries in Binaries. ArXiv abs/2305.04026 (2023). https://api.semanticscholar.org/CorpusID:258557875
  33. Xuezixiang Li. 2019. Learning Program-Wide Code Representations for Binary Diffing. Proceedings 2020 Network and Distributed System Security Symposium (2019).
  34. Steelix: program-state based binary fuzzing. In Proceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering. 627–637.
  35. α𝛼\alphaitalic_αDiff: Cross-Version Binary Code Similarity Detection with DNN. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE ’18). 667–678. https://doi.org/10.1145/3238147.3238199
  36. Demystifying the Vulnerability Propagation and Its Evolution via Dependency Trees in the NPM Ecosystem. In 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 672–684. https://doi.org/10.1145/3510003.3510142
  37. Learning Graph-based Code Representations for Source-level Functional Similarity Detection. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 345–357. https://doi.org/10.1109/ICSE48619.2023.00040
  38. DéjàVu: a map of code duplicates on GitHub. Proceedings of the ACM on Programming Languages 1, OOPSLA (2017), 1–28. https://doi.org/10.1145/3133908
  39. VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search. In Proceedings of the 2023 Network and Distributed Systems Security Symposium (NDSS).
  40. How Machine Learning Is Solving the Binary Function Similarity Problem. In 31st USENIX Security Symposium (USENIX Security 22). 2099–2116.
  41. Binpro: A tool for binary source code provenance. arXiv preprint arXiv:1711.00830 (2017).
  42. James Munkres. 1957. Algorithms for the assignment and transportation problems. Journal of the society for industrial and applied mathematics 5, 1 (1957), 32–38.
  43. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  44. Red Hat. 2022. What is software supply chain security? https://www.redhat.com/en/topics/security/what-is-software-supply-chain-security.
  45. Github Repository. 2023a. The LLVM Compiler Infrastructure. https://github.com/llvm/llvm-project.
  46. Github Repository. 2023b. Tree-sitter, a parser generator tool and incremental parsing library. https://github.com/tree-sitter/tree-sitter.
  47. Github Repository. 2023c. Zlib data compression library. https://github.com/madler/zlib.
  48. A Simple Recipe for Multilingual Grammatical Error Correction. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). Association for Computational Linguistics, 702–707. https://doi.org/10.18653/v1/2021.acl-short.89
  49. Sourcerercc: Scaling code clone detection to big-code. In Proceedings of the 38th International Conference on Software Engineering. 1157–1168. https://doi.org/10.1145/2884781.2884877
  50. Scantist. 2023. Scantist Binary Analysis. https://scantist.io.
  51. Driller: Augmenting fuzzing through selective symbolic execution.. In NDSS, Vol. 16. 1–16.
  52. A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 4367–4374. https://doi.org/10.24963/ijcai.2022/606
  53. Synopsys. [n. d.]. What is software composition analysis. https://www.synopsys.com/glossary/what-is-software-composition-analysis.html.
  54. Synopsys. 2023. Black Duck Binary Analysis (BDBA). https://www.synopsys.com/software-integrity/security-testing/software-composition-analysis/binary-analysis.html.
  55. Libdx: A cross-platform and accurate system to detect third-party libraries in binary code. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 104–115. https://doi.org/10.1109/SANER48275.2020.9054845
  56. LibDB: An Effective and Efficient Framework for Detecting Third-Party Libraries in Binaries. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR). 423–434. https://doi.org/10.1145/3524842.3528442
  57. Towards Understanding Third-party Library Dependency in C/C++ Ecosystem. In 37th IEEE/ACM International Conference on Automated Software Engineering. 1–12. https://doi.org/10.1145/3551349.3560432
  58. JTrans: Jump-Aware Transformer for Binary Code Similarity Detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 1–13. https://doi.org/10.1145/3533767.3534367
  59. Lilian Weng. 2021. Contrastive Representation Learning. lilianweng.github.io (May 2021). https://lilianweng.github.io/posts/2021-05-31-contrastive/
  60. MOVERY: A Precise Approach for Modified Vulnerable Code Clone Discovery from Modified Open-Source Software Components. In 31st USENIX Security Symposium (USENIX Security 22). USENIX Association, Boston, MA, 3037–3053. https://www.usenix.org/conference/usenixsecurity22/presentation/woo
  61. CENTRIS: A precise and scalable approach for identifying modified open-source software reuse. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 860–872. https://doi.org/10.1109/ICSE43902.2021.00083
  62. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100 (2022).
  63. OSSFP: Precise and Scalable C/C++ Third-Party Library Detection using Fingerprinting Functions. In Proceedings of the 45th International Conference on Software Engineering.
  64. Enhancing Coverage-Guided Fuzzing via Phantom Program. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1037–1049.
  65. One fuzzing strategy to rule them all. In Proceedings of the 44th International Conference on Software Engineering. 1634–1645.
  66. Evaluating and improving neural program-smoothing-based fuzzing. In Proceedings of the 44th International Conference on Software Engineering. 847–858.
  67. Improving Binary Code Similarity Transformer Models by Semantics-Driven Instruction Deemphasis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2023). 1106–1118. https://doi.org/10.1145/3597926.3598121
  68. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security. 363–376. https://doi.org/10.1145/3133956.3134018
  69. Interpretation-enabled software reuse detection based on a multi-level birthmark model. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 873–884. https://doi.org/10.1109/ICSE43902.2021.00084
  70. ModX: binary level partially imported third-party library detection via program modularization and semantic matching. In Proceedings of the 44th International Conference on Software Engineering. 1393–1405. https://doi.org/10.1145/3510003.3510627
  71. Codecmr: Cross-modal retrieval for function-level binary source code matching. Advances in Neural Information Processing Systems 33 (2020), 3872–3883.
  72. B2sfinder: detecting open-source software reuse in cots software. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1038–1049. https://doi.org/10.1109/ASE.2019.00100
  73. An extensive study on pre-trained models for program understanding and generation. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 39–51.
  74. Atvhunter: Reliable version detection of third-party libraries for vulnerability identification in android applications. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). IEEE, 1695–1707. https://doi.org/10.1109/ICSE43902.2021.00150
  75. Libid: reliable identification of obfuscated third-party android libraries. In Proceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis. 55–65. https://doi.org/10.1145/3293882.3330563
  76. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068 (2022).
  77. Neural Machine Translation Inspired Binary Code Similarity Comparison beyond Function Pairs. In Proceedings of the 2019 Network and Distributed Systems Security Symposium (NDSS).
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com