Cross-Inlining Binary Function Similarity Detection (2401.05739v1)
Abstract: Binary function similarity detection plays an important role in a wide range of security applications. Existing works usually assume that the query function and target function share equal semantics and compare their full semantics to obtain the similarity. However, we find that the function mapping is more complex, especially when function inlining happens. In this paper, we will systematically investigate cross-inlining binary function similarity detection. We first construct a cross-inlining dataset by compiling 51 projects using 9 compilers, with 4 optimizations, to 6 architectures, with 2 inlining flags, which results in two datasets both with 216 combinations. Then we construct the cross-inlining function mappings by linking the common source functions in these two datasets. Through analysis of this dataset, we find that three cross-inlining patterns widely exist while existing work suffers when detecting cross-inlining binary function similarity. Next, we propose a pattern-based model named CI-Detector for cross-inlining matching. CI-Detector uses the attributed CFG to represent the semantics of binary functions and GNN to embed binary functions into vectors. CI-Detector respectively trains a model for these three cross-inlining patterns. Finally, the testing pairs are input to these three models and all the produced similarities are aggregated to produce the final similarity. We conduct several experiments to evaluate CI-Detector. Results show that CI-Detector can detect cross-inlining pairs with a precision of 81% and a recall of 97%, which exceeds all state-of-the-art works.
- 2020a. Reveiws 1 - SciTools. https://news.sophos.com/en-us/2020/04/26/asnarok/. [Online; accessed 3-September-2021].
- 2020b. What’s up, Emotet? CERT Polska. https://cert.pl/en/posts/2020/02/whats-up-emotet/. [Online; accessed 3-September-2021].
- 2021. Gnulib - The GNU Portability Library. https://www.gnu.org/software/gnulib/. [Online; accessed 23-April-2022].
- 2021. IDA Pro Disassembler and Debugger - Hex Rays. https://www.hex-rays.com/ida-pro/. [Online; accessed 20-April-2022].
- 2022. Inline expansion. https://en.wikipedia.org/wiki/Inline_expansion. [Online; accessed 23-April-2022].
- 2022. Synopsys 2022 open source security and risk analysis report. https://www.synopsys.com/software-integrity/resources/analyst-reports/open-source-security-risk-analysis.html. [Online; accessed 29-May-2023].
- 2022. VDC Research White Paper — Finding Sources of Security in the Complex Software Supply Chains of Tomorrow. https://codesonar.grammatech.com/wp-form-vdc-research-software-supply-chain. [Online; accessed 29-May-2023].
- 2023. Binutils - GNU Porject. https://www.gnu.org/software/binutils/. [Online; accessed 25-July-2023].
- 2023. capstone-PyPI. https://pypi.org/project/capstone/. [Online; accessed 13-July-2022].
- 2023. Cisco-Talos/binary-function-similarity. https://github.com/Cisco-Talos/binary_function_similarity. [Online; accessed 13-July-2022].
- 2023. Coreutils - GNU core utilities. https://www.gnu.org/software/coreutils/. [Online; accessed 25-July-2023].
- 2023. Gawk - GNU Project. https://www.gnu.org/software/gawk/. [Online; accessed 25-July-2023].
- 2023a. NVD - CVE-2014-3569. https://nvd.nist.gov/vuln/detail/cve-2014-3569. [Online; accessed 6-June-2023].
- 2023b. NVD - CVE-2015-1792. https://nvd.nist.gov/vuln/detail/CVE-2015-1792. [Online; accessed 29-May-2023].
- 2023. OpenSSL. https://www.openssl.org/.
- 2023. Source code and dataset. . https://github.com/island255/cross-inlining_binary_function_similarity. [Online; accessed 21-July-2023].
- 2023. tensorflow-PyPI. https://pypi.org/project/tensorflow/. [Online; accessed 13-July-2022].
- Pär Andersson. 2009. Evaluation of inlining heuristics in industrial strength compilers for embedded systems.
- B2SMatcher: fine-Grained version identification of open-Source software in binary files. Cybersecurity 4, 1 (2021), 1–21.
- Bingo: Cross-architecture cross-os binary search. In Proceedings of the 2016 24th ACM SIGSOFT International Symposium on Foundations of Software Engineering. 678–689.
- An Adaptive Strategy for Inline Substitution. In Compiler Construction, 17th International Conference, CC 2008, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2008, Budapest, Hungary, March 29 - April 6, 2008. Proceedings (Lecture Notes in Computer Science, Vol. 4959). Springer, 69–84. https://doi.org/10.1007/978-3-540-78791-4_5
- Discriminative embeddings of latent variable models for structured data. In International conference on machine learning. PMLR, 2702–2711.
- Statistical similarity of binaries. Acm sigplan notices 51, 6 (2016), 266–280.
- Similarity of binaries through re-optimization. In Proceedings of the 38th ACM SIGPLAN conference on programming language design and implementation. 79–94.
- FirmUp: Precise Static Detection of Common Vulnerabilities in Firmware. In Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2018, Williamsburg, VA, USA, March 24-28, 2018, Xipeng Shen, James Tuck, Ricardo Bianchini, and Vivek Sarkar (Eds.). ACM, 392–404. https://doi.org/10.1145/3173162.3177157
- Yaniv David and Eran Yahav. 2014. Tracelet-based code search in executables. Acm Sigplan Notices 49, 6 (2014), 349–360.
- Jack W. Davidson and Anne M. Holler. 1992. Subprogram Inlining: A Study of its Effects on Program Execution Time. IEEE Trans. Software Eng. 18, 2 (1992), 89–102. https://doi.org/10.1109/32.121752
- Jeffrey Dean and Craig Chambers. 1994. Towards Better Inlining Decisions Using Inlining Trials. In Proceedings of the 1994 ACM Conference on LISP and Functional Programming, Orlando, Florida, USA, 27-29 June 1994. ACM, 273–282. https://doi.org/10.1145/182409.182489
- Kam1n0: Mapreduce-based assembly clone search for reverse engineering. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 461–470.
- Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization. In 2019 IEEE Symposium on Security and Privacy (SP). IEEE, 472–489.
- Identifying Open-Source License Violation and 1-day Security Risk at Large Scale. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, CCS 2017, Dallas, TX, USA, October 30 - November 03, 2017. ACM, 2169–2185. https://doi.org/10.1145/3133956.3134048
- Deepbindiff: Learning program-wide code representations for binary diffing. In Network and Distributed System Security Symposium.
- Irène A Durand and Robert Strandh. 2018. Partial Inlining Using Local Graph Rewriting. In 11th European Lisp Symposium.
- Blanket Execution: Dynamic Similarity Testing for Program Binaries and Components. In Proceedings of the 23rd USENIX Security Symposium, San Diego, CA, USA, August 20-22, 2014, Kevin Fu and Jaeyeon Jung (Eds.). USENIX Association, 303–317. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/egele
- Open-Source License Violations of Binary Software at Large Scale. In 26th IEEE International Conference on Software Analysis, Evolution and Reengineering, SANER 2019, Hangzhou, China, February 24-27, 2019. IEEE, 564–568. https://doi.org/10.1109/SANER.2019.8667977
- B2SFinder: Detecting Open-Source Software Reuse in COTS Software. In 34th IEEE/ACM International Conference on Automated Software Engineering, ASE 2019, San Diego, CA, USA, November 11-15, 2019. IEEE, 1038–1049. https://doi.org/10.1109/ASE.2019.00100
- Scalable graph-based bug search for firmware images. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security. 480–491.
- VulSeeker: a semantic learning based vulnerability seeker for cross-platform binary. In 2018 33rd IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 896–899.
- Some from here, some from there: Cross-project code reuse in github. In 2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). IEEE, 291–301.
- Software reuse cuts both ways: An empirical analysis of its relationship with security vulnerabilities. Journal of Systems and Software 172 (2021), 110653.
- Exploring network structure, dynamics, and function using NetworkX. Technical Report. Los Alamos National Lab.(LANL), Los Alamos, NM (United States).
- Irfan Ul Haq and Juan Caballero. 2021. A survey of binary code similarity. ACM Computing Surveys (CSUR) 54, 3 (2021), 1–38.
- Finding software license violations through binary code clone detection. In Proceedings of the 8th International Working Conference on Mining Software Repositories, MSR 2011 (Co-located with ICSE), Waikiki, Honolulu, HI, USA, May 21-28, 2011, Proceedings. ACM, 63–72. https://doi.org/10.1145/1985441.1985453
- Cross-Architecture Binary Semantics Understanding via Similar Code Comparison. In IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering, SANER 2016, Suita, Osaka, Japan, March 14-18, 2016 - Volume 1. IEEE Computer Society, 57–67. https://doi.org/10.1109/SANER.2016.50
- Jan Hubicka. 2004. The GCC call graph module: a framework for inter-procedural optimization.
- Wen-mei W. Hwu and Pohua P. Chang. 1989. Inline Function Expansion for Compiling C Programs. In Proceedings of the ACM SIGPLAN’89 Conference on Programming Language Design and Implementation (PLDI), Portland, Oregon, USA, June 21-23, 1989. ACM, 246–257. https://doi.org/10.1145/73141.74840
- Towards Automatic Software Lineage Inference. In Proceedings of the 22th USENIX Security Symposium, Washington, DC, USA, August 14-16, 2013, Samuel T. King (Ed.). USENIX Association, 81–96. https://www.usenix.org/conference/usenixsecurity13/technical-sessions/papers/jang
- 1-to-1 or 1-to-n? Investigating the effect of function inlining on binary similarity analysis. ACM Transactions on Software Engineering and Methodology ([n. d.]).
- 1-to-1 or 1-to-n? Investigating the Effect of Function Inlining on Binary Similarity Analysis. ACM Trans. Softw. Eng. Methodol. 32, 4, Article 87 (may 2023), 26 pages. https://doi.org/10.1145/3561385
- Comparing One with Many–Solving Binary2source Function Matching Under Function Inlining. arXiv preprint arXiv:2210.15159 (2022).
- PDiff: Semantic-based Patch Presence Testing for Downstream Kernels. In CCS ’20: 2020 ACM SIGSAC Conference on Computer and Communications Security, Virtual Event, USA, November 9-13, 2020, Jay Ligatti, Xinming Ou, Jonathan Katz, and Giovanni Vigna (Eds.). ACM, 1149–1163. https://doi.org/10.1145/3372297.3417240
- Ulf Kargén and Nahid Shahmehri. 2017. Towards robust instruction-level trace alignment of binary code. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017, Grigore Rosu, Massimiliano Di Penta, and Tien N. Nguyen (Eds.). IEEE Computer Society, 342–352. https://doi.org/10.1109/ASE.2017.8115647
- Open Source Software Detection using Function-level Static Software Birthmark. J. Internet Serv. Inf. Secur. 4, 4 (2014), 25–37. https://doi.org/10.22667/JISIS.2014.11.31.025
- Revisiting Binary Code Similarity Analysis using Interpretable Feature Engineering and Lessons Learned. (2020). arXiv:2011.10749 [cs.SE]
- Binary executable file similarity calculation using function matching. J. Supercomput. 75, 2 (2019), 607–622. https://doi.org/10.1007/s11227-016-1941-2
- Do developers update their library dependencies? Empirical Software Engineering 23, 1 (2018), 384–417.
- Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA (Proceedings of Machine Learning Research, Vol. 97), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.). PMLR, 3835–3845. http://proceedings.mlr.press/v97/li19d.html
- A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017).
- Lines of malicious code: insights into the malicious software industry. In 28th Annual Computer Security Applications Conference, ACSAC 2012, Orlando, FL, USA, 3-7 December 2012, Robert H’obbes’ Zakon (Ed.). ACM, 349–358. https://doi.org/10.1145/2420950.2421001
- α𝛼\alphaitalic_αdiff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering. 667–678.
- PG-VulNet: Detect Supply Chain Vulnerabilities in IoT Devices using Pseudo-code and Graphs. In ESEM ’22: ACM / IEEE International Symposium on Empirical Software Engineering and Measurement, Helsinki Finland, September 19 - 23, 2022, Fernanda Madeiral, Casper Lassenius, Tayana Conte, and Tomi Männistö (Eds.). ACM, 205–215. https://doi.org/10.1145/3544902.3546240
- How machine learning is solving the binary function similarity problem. In 31st USENIX Security Symposium (USENIX Security 22). 2099–2116.
- Safe: Self-attentive function embeddings for binary similarity. In International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 309–329.
- A comparison of event models for naive bayes text classification. In AAAI-98 workshop on learning for text categorization, Vol. 752. Madison, WI, 41–48.
- iBinHunt: Binary Hunting with Inter-procedural Control Flow. In Information Security and Cryptology - ICISC 2012 - 15th International Conference, Seoul, Korea, November 28-30, 2012, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 7839), Taekyoung Kwon, Mun-Kyu Lee, and Daesung Kwon (Eds.). Springer, 92–109. https://doi.org/10.1007/978-3-642-37682-5_8
- BinSim: Trace-based Semantic Binary Diffing via System Call Sliced Segment Equivalence Checking. In 26th USENIX Security Symposium, USENIX Security 2017, Vancouver, BC, Canada, August 16-18, 2017, Engin Kirda and Thomas Ristenpart (Eds.). USENIX Association, 253–270. https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/ming
- Memoized Semantics-Based Binary Diffing with Application to Malware Lineage Inference. In ICT Systems Security and Privacy Protection - 30th IFIP TC 11 International Conference, SEC 2015, Hamburg, Germany, May 26-28, 2015, Proceedings (IFIP Advances in Information and Communication Technology, Vol. 455), Hannes Federrath and Dieter Gollmann (Eds.). Springer, 416–430. https://doi.org/10.1007/978-3-319-18467-8_28
- BinPro: A Tool for Binary Source Code Provenance. CoRR abs/1711.00830 (2017). arXiv:1711.00830 http://arxiv.org/abs/1711.00830
- Sok: All you ever wanted to know about x86/x64 binary disassembly but were afraid to ask. In 2021 IEEE Symposium on Security and Privacy (SP). IEEE, 833–851.
- Trex: Learning execution semantics from micro-traces for binary similarity. arXiv preprint arXiv:2012.08680 (2020).
- RESource: A Framework for Online Matching of Assembly with Open Source Code. In Foundations and Practice of Security - 5th International Symposium, FPS 2012, Montreal, QC, Canada, October 25-26, 2012, Revised Selected Papers (Lecture Notes in Computer Science, Vol. 7743). Springer, 211–226. https://doi.org/10.1007/978-3-642-37119-6_14
- A cross-architecture instruction embedding model for natural language processing-inspired binary code analysis. arXiv preprint arXiv:1812.09652 (2018).
- Unleashing the Hidden Power of Compiler Optimization on Binary Code Difference: An Empirical Study. Association for Computing Machinery, New York, NY, USA, 142–157. https://doi.org/10.1145/3453483.3454035
- Osprey: A fast and accurate patch presence test framework for binaries. Comput. Commun. 173 (2021), 95–106. https://doi.org/10.1016/j.comcom.2021.03.011
- Shuai Wang and Dinghao Wu. 2017. In-memory fuzzing for binary code similarity analysis. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering, ASE 2017, Urbana, IL, USA, October 30 - November 03, 2017, Grigore Rosu, Massimiliano Di Penta, and Tien N. Nguyen (Eds.). IEEE Computer Society, 319–330. https://doi.org/10.1109/ASE.2017.8115645
- Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. 363–376.
- Interpretation-enabled Software Reuse Detection Based on a Multi-Level Birthmark Model. In 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 873–884. https://doi.org/10.1109/ICSE43902.2021.00084
- Patch based vulnerability matching for binary programs. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis. 376–387.
- Accurate and scalable cross-architecture cross-os binary code search with emulation. IEEE Transactions on Software Engineering 45, 11 (2018), 1125–1149.
- Order matters: Semantic-aware neural networks for binary code similarity detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 1145–1152.
- CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
- Hang Zhang and Zhiyun Qian. 2018. Precise and Accurate Patch Presence Test for Binaries. In 27th USENIX Security Symposium, USENIX Security 2018, Baltimore, MD, USA, August 15-17, 2018. USENIX Association, 887–902.
- Peng Zhao and José Nelson Amaral. 2003. To Inline or Not to Inline? Enhanced Inlining Decisions. In Languages and Compilers for Parallel Computing, 16th International Workshop, LCPC 2003, College Station, TX, USA, October 2-4, 2003, Revised Papers (Lecture Notes in Computer Science, Vol. 2958). Springer, 405–419. https://doi.org/10.1007/978-3-540-24644-2_26
- Neural machine translation inspired binary code similarity comparison beyond function pairs. arXiv preprint arXiv:1808.04706 (2018).