SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings (2209.02442v2)
Abstract: Function-level binary code similarity detection is a crucial aspect of cybersecurity. It enables the detection of bugs and patent infringements in released software and plays a pivotal role in preventing supply chain attacks. A practical embedding learning framework relies on the robustness of the assembly code representation and the accuracy of function-pair annotation, which is traditionally accomplished using supervised learning-based frameworks. However, annotating different function pairs with accurate labels poses considerable challenges. These supervised learning methods can be easily overtrained and suffer from representation robustness problems. To address these challenges, we propose SimCLF: A Simple Contrastive Learning Framework for Function-level Binary Embeddings. We take an unsupervised learning approach and formulate binary code similarity detection as instance discrimination. SimCLF directly operates on disassembled binary functions and could be implemented with any encoder. It does not require manually annotated information but only augmented data. Augmented data is generated using compiler optimization options and code obfuscation techniques. The experimental results demonstrate that SimCLF surpasses the state-of-the-art in accuracy and has a significant advantage in few-shot settings.
- Dos and don’ts of machine learning in computer security. CoRR abs/2010.09470. URL: https://arxiv.org/abs/2010.09470, arXiv:2010.09470.
- On finding duplication and near-duplication in large software systems, in: Proceedings of 2nd Working Conference on Reverse Engineering, pp. 86--95. doi:10.1109/WCRE.1995.514697.
- Compressing differences of executable code, in: In ACM SIGPLAN 1999 Workshop on Compiler Support for System Software (WCSSS’99.
- Adaptive duplicate detection using learnable string similarity measures, in: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, New York, NY, USA. p. 39–48. URL: https://doi.org/10.1145/956750.956759, doi:10.1145/956750.956759.
- A simple framework for contrastive learning of visual representations. URL: http://arxiv.org/abs/2002.05709, arXiv:2002.05709 [cs, stat]. number: arXiv:2002.05709.
- Big self-supervised models are strong semi-supervised learners. CoRR abs/2006.10029. URL: https://arxiv.org/abs/2006.10029, arXiv:2006.10029.
- Improved baselines with momentum contrastive learning. CoRR abs/2003.04297. URL: https://arxiv.org/abs/2003.04297, arXiv:2003.04297.
- Supervised learning of universal sentence representations from natural language inference data. URL: http://arxiv.org/abs/1705.02364, arXiv:1705.02364 [cs]. number: arXiv:1705.02364.
- Statistical similarity of binaries , 15.
- BERT: Pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota. pp. 4171--4186. URL: https://aclanthology.org/N19-1423, doi:10.18653/v1/N19-1423.
- Asm2vec: Boosting static representation robustness for binary clone search against code obfuscation and compiler optimization, in: 2019 IEEE Symposium on Security and Privacy (SP), IEEE. pp. 472--489. URL: https://ieeexplore.ieee.org/document/8835340/, doi:10.1109/SP.2019.00003.
- How contextual are contextualized word representations? comparing the geometry of BERT, ELMo, and GPT-2 embeddings, in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics. pp. 55--65. URL: https://www.aclweb.org/anthology/D19-1006, doi:10.18653/v1/D19-1006.
- Scalable graph-based bug search for firmware images, in: Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, Association for Computing Machinery, New York, NY, USA. p. 480–491. URL: https://doi.org/10.1145/2976749.2978370, doi:10.1145/2976749.2978370.
- Binhunt: Automatically finding semantic differences in binary programs, in: Information and Communications Security: 10th International Conference, ICICS 2008 Birmingham, UK, October 20 - 22, 2008 Proceedings, Springer-Verlag, Berlin, Heidelberg. p. 238–255. URL: https://doi.org/10.1007/978-3-540-88625-9_16, doi:10.1007/978-3-540-88625-9_16.
- REPRESENTATION DEGENERATION PROBLEM IN TRAINING NATURAL LANGUAGE GENERATION MOD- , 14.
- SimCSE: Simple contrastive learning of sentence embeddings. URL: http://arxiv.org/abs/2104.08821, arXiv:2104.08821 [cs]. number: arXiv:2104.08821.
- Bootstrap your own latent: A new approach to self-supervised learning. CoRR abs/2006.07733. URL: https://arxiv.org/abs/2006.07733, arXiv:2006.07733.
- A survey of binary code similarity. URL: http://arxiv.org/abs/1909.11424, arXiv:1909.11424 [cs]. number: arXiv:1909.11424.
- Momentum contrast for unsupervised visual representation learning , 10.
- Distilling the knowledge in a neural network. arXiv:1503.02531.
- Mutantx-s: Scalable malware clustering based on static features, in: Proceedings of the 2013 USENIX Conference on Annual Technical Conference, USENIX Association, USA. p. 187–198.
- Large-scale malware indexing using function-call graphs, in: Proceedings of the 16th ACM Conference on Computer and Communications Security, Association for Computing Machinery, New York, NY, USA. p. 611–620. URL: https://doi.org/10.1145/1653662.1653736, doi:10.1145/1653662.1653736.
- Towards the generalization of contrastive self-supervised learning. CoRR abs/2111.00743. URL: https://arxiv.org/abs/2111.00743, arXiv:2111.00743.
- Rendezvous: A search engine for binary code, in: 2013 10th Working Conference on Mining Software Repositories (MSR), IEEE. pp. 329--338. URL: http://ieeexplore.ieee.org/document/6624046/, doi:10.1109/MSR.2013.6624046.
- Semantic-aware binary code representation with BERT. CoRR abs/2106.05478. URL: https://arxiv.org/abs/2106.05478, arXiv:2106.05478.
- Distributed representations of sentences and documents , 9.
- On the sentence embeddings from pre-trained language models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics. pp. 9119--9130. URL: https://www.aclweb.org/anthology/2020.emnlp-main.733, doi:10.18653/v1/2020.emnlp-main.733.
- PalmTree: Learning an assembly language model for instruction embedding. URL: http://arxiv.org/abs/2103.03809, doi:10.1145/3460120.3484587, arXiv:2103.03809 [cs].
- Diff: Cross-version binary code similarity detection with dnn, in: Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, Association for Computing Machinery, New York, NY, USA. p. 667–678. URL: https://doi.org/10.1145/3238147.3238199, doi:10.1145/3238147.3238199.
- Gplag: Detection of software plagiarism by program dependence graph analysis, in: In the Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’06, ACM Press. pp. 872--881.
- Semantics-based obfuscation-resilient binary code similarity comparison with applications to software and algorithm plagiarism detection. IEEE Transactions on Software Engineering 43, 1157--1177. doi:10.1109/TSE.2017.2655046.
- SAFE: Self-attentive function embeddings for binary similarity. URL: http://arxiv.org/abs/1811.05296, arXiv:1811.05296 [cs]. number: arXiv:1811.05296.
- Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis, in: Proceedings 2019 Workshop on Binary Analysis Research, Internet Society, San Diego, CA. URL: https://www.ndss-symposium.org/wp-content/uploads/bar2019_20_Massarelli_paper.pdf, doi:10.14722/bar.2019.23020.
- Distributed representations of words and phrases and their compositionality, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, Curran Associates Inc., Red Hook, NY, USA. p. 3111–3119.
- TREX: Learning execution semantics from micro-traces for binary similarity , 19.
- Cross-architecture bug search in binary executables, in: 2015 IEEE Symposium on Security and Privacy, pp. 709--724. doi:10.1109/SP.2015.49.
- Leveraging semantic signatures for bug search in binary programs, in: Proceedings of the 30th Annual Computer Security Applications Conference, Association for Computing Machinery, New York, NY, USA. p. 406–415. URL: https://doi.org/10.1145/2664243.2664269, doi:10.1145/2664243.2664269.
- Sentence-BERT: Sentence embeddings using siamese BERT-networks. URL: http://arxiv.org/abs/1908.10084, arXiv:1908.10084 [cs]. number: arXiv:1908.10084.
- Training region-based object detectors with online hard example mining, in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 761--769. doi:10.1109/CVPR.2016.89.
- Attention is all you need, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA. p. 6000–6010.
- Understanding the behaviour of contrastive loss, in: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2495--2504. doi:10.1109/CVPR46437.2021.00252.
- jtrans: Jump-aware transformer for binary code similarity. arXiv:2205.12713.
- Understanding contrastive representation learning through alignment and uniformity on the hypersphere, in: Proceedings of the 37th International Conference on Machine Learning, JMLR.org.
- Unsupervised feature learning via non-parametric instance discrimination, in: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, IEEE. pp. 3733--3742. URL: https://ieeexplore.ieee.org/document/8578491/, doi:10.1109/CVPR.2018.00393.
- Neural network-based graph embedding for cross-platform binary code similarity detection , 14.
- Modeling and discovering vulnerabilities with code property graphs, in: 2014 IEEE Symposium on Security and Privacy, pp. 590--604. doi:10.1109/SP.2014.44.
- ConSERT: A contrastive framework for self-supervised sentence representation transfer. URL: http://arxiv.org/abs/2105.11741, arXiv:2105.11741 [cs]. number: arXiv:2105.11741.
- Order matters: Semantic-aware neural networks for binary code similarity detection 34, 1145--1152. URL: https://aaai.org/ojs/index.php/AAAI/article/view/5466, doi:10.1609/aaai.v34i01.5466.
- Fast algorithms to evaluate collaborative filtering recommender systems. Knowledge-Based Systems 96, 96--103. URL: https://www.sciencedirect.com/science/article/pii/S0950705115005079, doi:https://doi.org/10.1016/j.knosys.2015.12.025.
- Hikari – an improvement over obfuscator-llvm. URL: https://github.com/HikariObfuscator/Hikari.