Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TRACED: Execution-aware Pre-training for Source Code (2306.07487v1)

Published 13 Jun 2023 in cs.SE

Abstract: Most existing pre-trained LLMs for source code focus on learning the static code text, typically augmented with static code structures (abstract syntax tree, dependency graphs, etc.). However, program semantics will not be fully exposed before the real execution. Without an understanding of the program execution, statically pre-trained models fail to comprehensively capture the dynamic code properties, such as the branch coverage and the runtime variable values, and they are consequently less effective at code understanding tasks, such as retrieving semantic clones and detecting software vulnerabilities. To close the gap between the static nature of LLMs and the dynamic characteristics of programs, we introduce TRACED, an execution-aware pre-training strategy for source code. Specifically, we pre-train code LLMs with a combination of source code, executable inputs, and corresponding execution traces. Our goal is to teach code models the complicated execution logic during the pre-training, enabling the model to statically estimate the dynamic code properties without repeatedly executing code during task-specific fine-tuning. To illustrate the effectiveness of our proposed approach, we fine-tune and evaluate TRACED on three downstream tasks: static execution estimation, clone retrieval, and vulnerability detection. The empirical results show that TRACED relatively improves the statically pre-trained code models by 12.4% for complete execution path prediction and by 25.2% for runtime variable value predictions. TRACED also significantly outperforms statically pre-trained models in clone retrieval and vulnerability detection across four public benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Unified Pre-training for Program Understanding and Generation. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online, 2655–2668. https://www.aclweb.org/anthology/2021.naacl-main.211
  2. Program Synthesis with Large Language Models. CoRR abs/2108.07732 (2021). arXiv:2108.07732 https://arxiv.org/abs/2108.07732
  3. Static Prediction of Runtime Errors by Learning to Execute Programs with External Resource Descriptions. https://openreview.net/forum?id=SIcz2sObJ-5
  4. Learning to Execute Programs with Instruction Pointer Attention Graph Neural Networks. In Advances in Neural Information Processing Systems, Vol. 33. Curran Associates, Inc., 8626–8637. https://papers.nips.cc/paper/2020/hash/62326dc7c4f7b849d6f013ba46489d6c-Abstract.html
  5. Self-Supervised Contrastive Learning for Code Retrieval and Summarization via Semantic-Preserving Transformations. In SIGIR ’21 (Virtual Event, Canada). 511–521. https://doi.org/10.1145/3404835.3462840
  6. Exploring Software Naturalness through Neural Language Models. arXiv:2006.12641 [cs.CL]
  7. NatGen: Generative pre-training by” Naturalizing” source code. In 2022 The ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM.
  8. Deep Learning based Vulnerability Detection: Are We There Yet. IEEE Transactions on Software Engineering (2021), 1–1. https://doi.org/10.1109/TSE.2021.3087402
  9. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  10. Towards Learning (Dis)-Similarity of Source Code from Program Contrasts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Dublin, Ireland, 6300–6312. https://doi.org/10.18653/v1/2022.acl-long.436
  11. Patching as Translation: the Data and the Metaphor. In 35th IEEE/ACM International Conference on Automated Software Engineering (Virtual Event, Australia) (ASE ’20). https://doi.org/10.1145/3324884.3416587
  12. PyTorch: An Imperative Style, High-Performance Deep Learning Library.
  13. Evaluating Large Language Models Trained on Code. CoRR abs/2107.03374 (2021). arXiv:2107.03374 https://arxiv.org/abs/2107.03374
  14. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
  15. Competition-Level Code Generation with AlphaCode. ArXiv abs/2203.07814 (2022).
  16. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 1536–1547. https://doi.org/10.18653/v1/2020.findings-emnlp.139
  17. UniXcoder: Unified Cross-Modal Pre-training for Code Representation. https://doi.org/10.48550/ARXIV.2203.03850
  18. GraphCode{BERT}: Pre-training Code Representations with Data Flow. In International Conference on Learning Representations. https://openreview.net/forum?id=jLoC4ez43PZ
  19. Code vectors: understanding programs through embedded abstracted symbolic traces. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2018). Association for Computing Machinery, New York, NY, USA, 163–174. https://doi.org/10.1145/3236024.3236085
  20. On the Naturalness of Software. In Proceedings of the 34th International Conference on Software Engineering (Zurich, Switzerland) (ICSE ’12). IEEE Press, 837–847.
  21. Impact of Code Language Models on Automated Program Repair. arXiv:2302.05020 [cs.SE]
  22. KNOD: Domain Knowledge Distilled Tree Decoder for Automated Program Repair. arXiv:2302.01857 [cs.SE]
  23. CURE: Code-Aware Neural Machine Translation for Automatic Program Repair. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 1161–1173. https://doi.org/10.1109/ICSE43902.2021.00107
  24. Learning and evaluating contextual embedding of source code. In ICML 2020.
  25. Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code. In 2020 IEEE/ACM 42nd International Conference on Software Engineering (ICSE). 1073–1085.
  26. VUDDY: A Scalable Approach for Vulnerable Code Clone Discovery. In 2017 IEEE Symposium on Security and Privacy (SP). 595–614. https://doi.org/10.1109/SP.2017.62
  27. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2015).
  28. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Brussels, Belgium, 66–71. https://doi.org/10.18653/v1/D18-2012
  29. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
  30. VulPecker: An Automated Vulnerability Detection System Based on Code Similarity Analysis. In Proceedings of the 32nd Annual Conference on Computer Security Applications (Los Angeles, California, USA) (ACSAC ’16). 201–213. https://doi.org/10.1145/2991079.2991102
  31. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  32. CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation. CoRR abs/2102.04664 (2021).
  33. Convolutional neural networks over tree structures for programming language processing. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 1287–1293.
  34. Learning Deep Semantics for Test Completion. arXiv. https://doi.org/10.48550/arXiv.2302.10166 arXiv:2302.10166 [cs].
  35. SPT-Code: Sequence-to-Sequence Pre-Training for Learning Source Code Representations. CoRR abs/2201.01549 (2022). arXiv:2201.01549 https://arxiv.org/abs/2201.01549
  36. Show Your Work: Scratchpads for Intermediate Computation with Language Models. https://doi.org/10.48550/arXiv.2112.00114
  37. Jibesh Patra and Michael Pradel. 2022. Nalin: learning from runtime behavior to find name-value inconsistencies in jupyter notebooks. In Proceedings of the 44th International Conference on Software Engineering. ACM, Pittsburgh Pennsylvania, 1469–1481. https://doi.org/10.1145/3510003.3510144
  38. StateFormer: fine-grained type recovery from binaries using generative state modeling. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2021). Association for Computing Machinery, New York, NY, USA, 690–702. https://doi.org/10.1145/3468264.3468607
  39. NeuDep: neural binary memory dependence analysis. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE 2022). Association for Computing Machinery, New York, NY, USA, 747–759. https://doi.org/10.1145/3540250.3549147
  40. Trex: Learning Execution Semantics from Micro-Traces for Binary Similarity. CoRR abs/2012.08680 (2020). arXiv:2012.08680 https://arxiv.org/abs/2012.08680
  41. Project CodeNet: A Large-Scale AI for Code Dataset for Learning a Diversity of Coding Tasks. CoRR abs/2105.12655 (2021). arXiv:2105.12655 https://arxiv.org/abs/2105.12655
  42. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
  43. On the ”Naturalness” of Buggy Code. In Proceedings of the 38th International Conference on Software Engineering (Austin, Texas) (ICSE ’16). Association for Computing Machinery, New York, NY, USA, 428–439. https://doi.org/10.1145/2884781.2884848
  44. Scott Reed and Nando de Freitas. 2016. Neural Programmer-Interpreters. https://doi.org/10.48550/arXiv.1511.06279 arXiv:1511.06279 [cs].
  45. Beatriz Souza and Michael Pradel. 2023. LExecutor: Learning-Guided Execution. https://doi.org/10.48550/arXiv.2302.02343 arXiv:2302.02343 [cs].
  46. Dynamic Neural Program Embeddings for Program Repair. https://openreview.net/forum?id=BJuWrGW0Z
  47. Ke Wang and Zhendong Su. 2020. Blended, precise semantic program embeddings. In Proceedings of the 41st ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2020). Association for Computing Machinery, New York, NY, USA, 121–134. https://doi.org/10.1145/3385412.3385999
  48. SynCoBERT: Syntax-Guided Multi-Modal Contrastive Pre-Training for Code Representation. https://doi.org/10.48550/ARXIV.2108.04556
  49. CodeT5: Identifier-aware Unified Pre-trained Encoder-Decoder Models for Code Understanding and Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021.
  50. A Systematic Evaluation of Large Language Models of Code. arXiv preprint arXiv:2202.13169 (2022).
  51. Wojciech Zaremba and Ilya Sutskever. 2015. Learning to Execute. https://doi.org/10.48550/arXiv.1410.4615
  52. Andreas Zeller. 2005. Why Programs Fail: A Guide to Systematic Debugging. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.
  53. D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis. In 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). 111–120. https://doi.org/10.1109/ICSE-SEIP52600.2021.00020
  54. Devign: Effective vulnerability identification by learning comprehensive program semantics via graph neural networks. In Advances in Neural Information Processing Systems. 10197–10207.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yangruibo Ding (17 papers)
  2. Ben Steenhoek (1 paper)
  3. Kexin Pei (20 papers)
  4. Gail Kaiser (17 papers)
  5. Wei Le (24 papers)
  6. Baishakhi Ray (88 papers)
Citations (24)