CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision (2402.16928v1)
Abstract: Binary code representation learning has shown significant performance in binary analysis tasks. But existing solutions often have poor transferability, particularly in few-shot and zero-shot scenarios where few or no training samples are available for the tasks. To address this problem, we present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code (i.e., assembly code) and get better transferability. At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations (in natural language), resulting a model able to generate better embeddings for binary code. To enable this alignment training, we then propose an efficient dataset engine that could automatically generate a large and diverse dataset comprising of binary code and corresponding natural language explanations. We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP. The evaluations of CLAP across various downstream tasks in binary analysis all demonstrate exceptional performance. Notably, without any task-specific training, CLAP is often competitive with a fully supervised baseline, showing excellent transferability. We release our pre-trained model and code at https://github.com/Hustcw/CLAP.
- Improving language models by retrieving from trillions of tokens, December 2021.
- Language Models are Few-Shot Learners, July 2020.
- Canonical. Ubuntu: Enterprise open source and linux. https://ubuntu.com/, 2023. Accessed: 2023-06-01.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
- Neural nets can learn function type signatures from binaries. In USENIX Security Symposium, pages 99–116, 2017.
- Investigating Graph Embedding Methods for Cross-Platform Binary Code Similarity Detection. In 2022 IEEE 7th European Symposium on Security and Privacy (EuroS&P), pages 60–73, Genoa, Italy, June 2022. IEEE.
- Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In 2019 IEEE Symposium on Security and Privacy (SP), pages 472–489, 2019.
- Deepvsa: Facilitating value-set analysis with deep learning for postmortem program analysis. In USENIX Security Symposium, pages 1787–1804, 2019.
- Hex-Rays. Ida pro disassembler and debugger, 2015.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907, 2016.
- Taku Kudo. Subword regularization: Improving neural network translation models with multiple subword candidates. arXiv preprint arXiv:1804.10959, 2018.
- Dire: A neural approach to decompiled identifier naming. In 2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 628–639. IEEE, 2019.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- PalmTree: Learning an Assembly Language Model for Instruction Embedding. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, pages 3236–3251, November 2021.
- Graph matching networks for learning the similarity of graph structured objects. In International conference on machine learning, pages 3835–3845. PMLR, 2019.
- Graph Matching Networks for Learning the Similarity of Graph Structured Objects. In arXiv:1904.12787 [Cs, Stat], May 2019.
- α𝛼\alphaitalic_αdiff: cross-version binary code similarity detection with dnn. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, pages 667–678, 2018.
- CMA-CLIP: Cross-Modality Attention CLIP for Image-Text Classification, December 2021.
- RoBERTa: A Robustly Optimized BERT Pretraining Approach, July 2019.
- LLVM. Clang: a c language family frontend for llvm. https://clang.llvm.org, 2023. Accessed: 2023-06-01.
- VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search.
- How Machine Learning Is Solving the Binary Function Similarity Problem. page 18, 2022.
- Investigating Graph Embedding Neural Networks with Unsupervised Features Extraction for Binary Analysis. In Proceedings 2019 Workshop on Binary Analysis Research, San Diego, CA, 2019. Internet Society.
- SAFE: Self-Attentive Function Embeddings for Binary Similarity. In arXiv:1811.05296 [Cs], December 2019.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- OpenAI. GPT-4 Technical Report. Technical report.
- OpenAI. Chatgpt. https://chat.openai.com, 2023. Accessed: 2023-06-06.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037, 2019.
- Trex: Learning Execution Semantics from Micro-Traces for Binary Similarity. arXiv:2012.08680 [cs], April 2021.
- Learning Transferable Visual Models From Natural Language Supervision, February 2021.
- Language Models are Unsupervised Multitask Learners.
- Malware detection by eating a whole exe. arXiv preprint arXiv:1710.09435, 2017.
- sentence transformers. Sentence transformer: Mpnet-base-v2. https://huggingface.co/sentence-transformers/all-mpnet-base-v2/, 2023. Accessed: 2023-06-01.
- Recognizing functions in binaries with neural networks. In 24th {normal-{\{{USENIX}normal-}\}} Security Symposium ({normal-{\{{USENIX}normal-}\}} Security 15), pages 611–626, 2015.
- LLaMA: Open and Efficient Foundation Language Models.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Order Matters: Sequence to sequence for sets. arXiv:1511.06391 [cs, stat], February 2016.
- jTrans: Jump-aware transformer for binary code similarity detection. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 1–13, Virtual South Korea, July 2022. ACM.
- Neural network-based graph embedding for cross-platform binary code similarity detection. In Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, pages 363–376, 2017.
- Neural Network-based Graph Embedding for Cross-Platform Binary Code Similarity Detection. In Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 363–376, Dallas Texas USA, October 2017. ACM.
- Codee: a tensor embedding scheme for binary code search. IEEE Transactions on Software Engineering, 48(7):2224–2244, 2021.
- Order Matters: Semantic-Aware Neural Networks for Binary Code Similarity Detection. Proceedings of the AAAI Conference on Artificial Intelligence, 34(01):1145–1152, April 2020.
- CodeCMR: Cross-Modal Retrieval For Function-Level Binary Source Code Matching. In Advances in Neural Information Processing Systems, volume 33, pages 3872–3883. Curran Associates, Inc., 2020.
- LIMA: Less Is More for Alignment, May 2023.
- Hao Wang (1120 papers)
- Zeyu Gao (39 papers)
- Chao Zhang (907 papers)
- Zihan Sha (2 papers)
- Mingyang Sun (38 papers)
- Yuchen Zhou (38 papers)
- Wenyu Zhu (9 papers)
- Wenju Sun (6 papers)
- Han Qiu (60 papers)
- Xi Xiao (82 papers)