Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CEBin: A Cost-Effective Framework for Large-Scale Binary Code Similarity Detection (2402.18818v1)

Published 29 Feb 2024 in cs.SE and cs.CR

Abstract: Binary code similarity detection (BCSD) is a fundamental technique for various application. Many BCSD solutions have been proposed recently, which mostly are embedding-based, but have shown limited accuracy and efficiency especially when the volume of target binaries to search is large. To address this issue, we propose a cost-effective BCSD framework, CEBin, which fuses embedding-based and comparison-based approaches to significantly improve accuracy while minimizing overheads. Specifically, CEBin utilizes a refined embedding-based approach to extract features of target code, which efficiently narrows down the scope of candidate similar code and boosts performance. Then, it utilizes a comparison-based approach that performs a pairwise comparison on the candidates to capture more nuanced and complex relationships, which greatly improves the accuracy of similarity detection. By bridging the gap between embedding-based and comparison-based approaches, CEBin is able to provide an effective and efficient solution for detecting similar code (including vulnerable ones) in large-scale software ecosystems. Experimental results on three well-known datasets demonstrate the superiority of CEBin over existing state-of-the-art (SOTA) baselines. To further evaluate the usefulness of BCSD in real world, we construct a large-scale benchmark of vulnerability, offering the first precise evaluation scheme to assess BCSD methods for the 1-day vulnerability detection task. CEBin could identify the similar function from millions of candidate functions in just a few seconds and achieves an impressive recall rate of $85.46\%$ on this more practical but challenging task, which are several order of magnitudes faster and $4.07\times$ better than the best SOTA baseline. Our code is available at https://github.com/Hustcw/CEBin.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Practical binary code similarity detection with bert-based transferable similarity learning. In Proceedings of the 38th Annual Computer Security Applications Conference, ACSAC ’22, page 361–374, New York, NY, USA, 2022. Association for Computing Machinery.
  2. Fossil: A resilient and efficient system for identifying foss functions in malware binaries. ACM Transactions on Privacy and Security, page 1–34, Feb 2018.
  3. Control flow-based malware variantdetection. IEEE Transactions on Dependable and Secure Computing, 11(4):307–317, 2013.
  4. Discriminative embeddings of latent variable models for structured data. In International conference on machine learning, pages 2702–2711. PMLR, 2016.
  5. Statistical similarity of binaries. ACM SIGPLAN Notices, 51(6):266–280, 2016.
  6. Similarity of binaries through re-optimization. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 79–94, 2017.
  7. Firmup: Precise static detection of common vulnerabilities in firmware. ACM SIGPLAN Notices, 53(2):392–404, 2018.
  8. Tracelet-based code search in executables. Acm Sigplan Notices, 49(6):349–360, 2014.
  9. Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 472–489. IEEE, 2019.
  10. Deepbindiff: Learning program-wide code representations for binary diffing. In Network and Distributed System Security Symposium, 2020.
  11. Graph-based comparison of executable objects (english version). Sstic, 5(1):3, 2005.
  12. discovre: Efficient cross-architecture identification of bugs in binary code. In NDSS, volume 52, pages 58–79, 2016.
  13. Binclone: Detecting code clones in malware. In 2014 Eighth International Conference on Software Security and Reliability (SERE), pages 78–87. IEEE, 2014.
  14. Extracting conditional formulas for cross-platform bug search. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 346–359, 2017.
  15. Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 480–491, 2016.
  16. Binhunt: Automatically finding semantic differences in binary programs. In International Conference on Information and Communications Security, pages 238–255. Springer, 2008.
  17. Vulseeker: a semantic learning based vulnerability seeker for cross-platform binary. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 896–899. IEEE, 2018.
  18. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
  19. Mutantx-s: Scalable malware clustering based on static features. In 2013 {normal-{\{{USENIX}normal-}\}} Annual Technical Conference ({normal-{\{{USENIX}normal-}\}}{normal-{\{{ATC}normal-}\}} 13), pages 187–198, 2013.
  20. Cross-architecture binary semantics understanding via similar code comparison. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), volume 1, pages 57–67. IEEE, 2016.
  21. Binsequence: Fast, accurate and scalable binary code reuse detection. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security, pages 155–166, 2017.
  22. Redebug: Finding unpatched code clones in entire os distributions. In 2012 IEEE Symposium on Security and Privacy, pages 48–62, 2012.
  23. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547, 2019.
  24. Towards robust instruction-level trace alignment of binary code. In 2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 342–352. IEEE, 2017.
  25. Binary executable file similarity calculation using function matching. The Journal of Supercomputing, 75(2):607–622, 2019.
  26. Adam: A method for stochastic optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
  27. Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018.
  28. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  29. Palmtree: Learning an assembly language model for instruction embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, CCS ’21, page 3236–3251, New York, NY, USA, 2021. Association for Computing Machinery.
  30. Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning, pages 3835–3845. PMLR, 2019.
  31. α𝛼\alphaitalic_αdiff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 667–678, 2018.
  32. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software plagiarism detection. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, pages 389–400, 2014.
  33. Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering, 43(12):1157–1177, 2017.
  34. Vulhawk: Cross-architecture vulnerability detection with entropy-based binary code search. In 30th Annual Network and Distributed System Security Symposium, NDSS 2023, San Diego, California, USA, February 27 - March 3, 2023. The Internet Society, 2023.
  35. How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22), pages 2099–2116, Boston, MA, August 2022. USENIX Association.
  36. Investigating graph embedding neural networks with unsupervised features extraction for binary analysis. In Proceedings of the 2nd Workshop on Binary Analysis Research (BAR), 2019.
  37. Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 309–329. Springer, 2019.
  38. Binsign: Fingerprinting binary functions to support automated analysis of code executables. In IFIP International Conference on ICT Systems Security and Privacy Protection, pages 341–355. Springer, 2017.
  39. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  40. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA, 2019.
  41. Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680, 2020.
  42. Learning approximate execution semantics from traces for binary function similarity. IEEE Transactions on Software Engineering, 2022.
  43. Cross-architecture bug search in binary executables. In 2015 IEEE Symposium on Security and Privacy, pages 709–724. IEEE, 2015.
  44. Leveraging semantic signatures for bug search in binary programs. In Proceedings of the 30th Annual Computer Security Applications Conference, pages 406–415, 2014.
  45. A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. arXiv preprint arXiv:1812.09652, 2018.
  46. Binary similarity detection using machine learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security, PLAS ’18, page 42–47, New York, NY, USA, 2018. Association for Computing Machinery.
  47. Binary similarity detection using machine learning. In Proceedings of the 13th Workshop on Programming Languages and Analysis for Security, pages 42–47, 2018.
  48. Binarm: Scalable and efficient detection of vulnerabilities in firmware images of intelligent electronic devices. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment, pages 114–138. Springer, 2018.
  49. Libdx: A cross-platform and accurate system to detect third-party libraries in binary code. In 2020 IEEE 27th International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 104–115, 2020.
  50. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  51. jtrans: Jump-aware transformer for binary code similarity. arXiv preprint arXiv:2205.12713, 2022.
  52. Enhancing dnn-based binary code function search with low-cost equivalence checking. IEEE Transactions on Software Engineering, 49(1):226–250, 2023.
  53. Improving binary code similarity transformer models by semantics-driven instruction deemphasis. In Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2023, page 1106–1118, New York, NY, USA, 2023. Association for Computing Machinery.
  54. Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 363–376, 2017.
  55. Spain: security patch analysis for binaries towards understanding the pain and pills. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pages 462–472. IEEE, 2017.
  56. Codee: A tensor embedding scheme for binary code search. IEEE Transactions on Software Engineering, 2021.
  57. Order matters: Semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 1145–1152, 2020.
  58. Codecmr: Cross-modal retrieval for function-level binary source code matching. Advances in Neural Information Processing Systems, 33:3872–3883, 2020.
  59. Bbdetector: A precise and scalable third-party library detection in binary executables with fine-grained function-level features. Applied Sciences, 13(1), 2023.
  60. Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv preprint arXiv:1808.04706, 2018.
  61. zynamics. Bindiff. "https://www.zynamics.com/bindiff.html", 2018.
Citations (3)

Summary

We haven't generated a summary for this paper yet.