Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TransformCode: A Contrastive Learning Framework for Code Embedding via Subtree Transformation (2311.08157v2)

Published 10 Nov 2023 in cs.SE and cs.AI

Abstract: AI has revolutionized software engineering (SE) by enhancing software development efficiency. The advent of pre-trained models (PTMs) leveraging transfer learning has significantly advanced AI for SE. However, existing PTMs that operate on individual code tokens suffer from several limitations: They are costly to train and fine-tune; and they rely heavily on labeled data for fine-tuning on task-specific datasets. In this paper, we present TransformCode, a novel framework that learns code embeddings in a contrastive learning manner. Our framework is encoder-agnostic and language-agnostic, which means that it can leverage any encoder model and handle any programming language. We also propose a novel data-augmentation technique called abstract syntax tree (AST) transformation, which applies syntactic and semantic transformations to the original code snippets, to generate more diverse and robust samples for contrastive learning. Our framework has several advantages over existing methods: (1) It is flexible and adaptable, because it can easily be extended to other downstream tasks that require code representation (such as code-clone detection and classification); (2) it is efficient and scalable, because it does not require a large model or a large amount of training data, and it can support any programming language; (3) it is not limited to unsupervised learning, but can also be applied to some supervised learning tasks by incorporating task-specific labels or objectives; and (4) it can also adjust the number of encoder parameters based on computing resources. We evaluate our framework on several code-related tasks, and demonstrate its effectiveness and superiority over the state-of-the-art methods such as SourcererCC, Code2vec, and InferCode.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “CodeBERT: A pre-trained model for programming and natural languages,” in Proceedings of the Findings of the Association for Computational Linguistics (EMNLP’20), T. Cohn, Y. He, and Y. Liu, Eds., 2020, pp. 1536–1547.
  2. D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. B. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “GraphCodeBERT: Pre-training code representations with data flow,” in Proceedings of the 9th International Conference on Learning Representations (ICLR’21), 2021, pp. 1–18.
  3. W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “Unified pre-training for program understanding and generation,” in Proceedings of the 59th Conference of the North American Chapter of the Association for Computational Linguistics (ACL’21), 2021, pp. 2655–2668.
  4. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems (NeurIPS’17), 2017, pp. 6000–6010.
  5. N. D. Q. Bui, H. Le, Y. Wang, J. Li, A. D. Gotmare, and S. C. H. Hoi, “CodeTF: One-stop transformer library for state-of-the-art code LLM,” arXiv, vol. 2306.00029, 2023.
  6. H. Batra, N. S. Punn, S. K. Sonbhadra, and S. Agarwal, “BERT-based sentiment analysis: A software engineering perspective,” in Proceedings of the 32nd International Conference on Database and Expert Systems Applications (DEXA’21), 2021, pp. 138–148.
  7. S. Gao, C. Gao, Y. He, J. Zeng, L. Y. Nie, X. Xia, and M. R. Lyu, “Code structure-guided transformer for source code summarization,” ACM Transactions on Software Engineering and Methodology, vol. 32, no. 1, pp. 1–32, 2023.
  8. W. Ahmad, S. Chakraborty, B. Ray, and K.-W. Chang, “A transformer-based approach for source code summarization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL’20), 2020, pp. 4998–5007.
  9. Z. Tang, X. Shen, C. Li, J. Ge, L. Huang, Z. Zhu, and B. Luo, “AST-Trans: Code summarization with efficient tree-structured attention,” in Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22), 2022, pp. 150–162.
  10. J. Zhang, X. Wang, H. Zhang, H. Sun, and X. Liu, “Retrieval-based neural source code summarization,” in Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering (ICSE’20), 2020, pp. 1385–1397.
  11. S. Liu, Y. Chen, X. Xie, J. Siow, and Y. Liu, “Retrieval-augmented generation for code summarization via hybrid gnn,” in Proceedings of the 16th International Conference on Learning Representations (ICLR’20), 2020, pp. 1–16.
  12. Y. Wu, D. Zou, S. Dou, W. Yang, D. Xu, and H. Jin, “VulCNN: An image-inspired scalable vulnerability detection system,” in Proceedings of the 44th IEEE/ACM International Conference on Software Engineering (ICSE’22), 2022, pp. 2365–2376.
  13. S. Chakraborty, R. Krishna, Y. Ding, and B. Ray, “Deep learning based vulnerability detection: Are we there yet?” IEEE Transactions on Software Engineering, vol. 48, no. 9, pp. 3280–3296, 2022.
  14. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20), 2020, pp. 9726–9735.
  15. X. Chen, H. Fan, R. B. Girshick, and K. He, “Improved baselines with momentum contrastive learning,” arXiv, vol. 2003.04297, 2020.
  16. Z. Xian, M. Azam, and N. Bouguila, “Statistical modeling using bounded asymmetric gaussian mixtures: Application to human action and gender recognition,” in Proceedings of the 22nd International Conference on Information Reuse and Integration for Data Science (IRI’21), 2021, pp. 41–48.
  17. Z. Xian, M. Azam, M. Amayri, and N. Bouguila, “Model selection criterion for multivariate bounded asymmetric gaussian mixture model,” in Proceedings of the 29th European Signal Processing Conference (EUSIPCO’21), 2021, pp. 1436–1440.
  18. C. Fang, Z. Liu, Y. Shi, J. Huang, and Q. Shi, “Functional code clone detection with syntax and semantics fusion learning,” in Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA’20), 2020, pp. 516–527.
  19. N. Mehrotra, N. Agarwal, P. Gupta, S. Anand, D. Lo, and R. Purandare, “Modeling functional similarity in source code with graph-based siamese networks,” IEEE Transactions on Software Engineering, vol. 48, no. 10, pp. 3771–3789, 2022.
  20. C. Liu, Z. Lin, J.-G. Lou, L. Wen, and D. Zhang, “Can neural clone detection generalize to unseen functionalities?” in Proceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE’21), 2021, pp. 617–629.
  21. M. Hearst, S. Dumais, E. Osuna, J. Platt, and B. Scholkopf, “Support vector machines,” IEEE Intelligent Systems and their Applications, vol. 13, no. 4, pp. 18–28, 1998.
  22. T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), 2016, pp. 785–794.
  23. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT’19), 2019, pp. 4171–4186.
  24. W. Zheng, H. Zhou, M. Li, and J. Wu, “CodeAttention: translating source code to comments by exploiting the code constructs,” Frontiers of Computer Science, vol. 13, no. 3, pp. 565–578, 2019.
  25. L. Mou, G. Li, L. Zhang, T. Wang, and Z. Jin, “Convolutional neural networks over tree structures for programming language processing,” in Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI’16), 2016, pp. 1287–1293.
  26. J. Zhang, X. Wang, H. Zhang, H. Sun, K. Wang, and X. Liu, “A novel neural source code representation based on abstract syntax tree,” in Proceedings of the 41st IEEE/ACM International Conference on Software Engineering (ICSE’19), 2019, pp. 783–794.
  27. U. Alon, M. Zilberstein, O. Levy, and E. Yahav, “Code2vec: Learning distributed representations of code,” Proceedings of the ACM on Programming Languages, vol. 3, no. POPL, pp. 1–29, 2019.
  28. U. Alon, S. Brody, O. Levy, and E. Yahav, “Code2seq: Generating sequences from structured representations of code,” in Proceedings of the 7th International Conference on Learning Representations (ICLR’19), 2019, pp. 1–22.
  29. N. D. Q. Bui, Y. Yu, and L. Jiang, “Treecaps: Tree-based capsule networks for source code processing,” in Proceedings of the 35th AAAI Conference on Artificial Intelligence (AAAI’21), 2021, pp. 30–38.
  30. Y.-F. Ma and M. Li, “The flowing nature matters: Feature learning from the control flow graph of source code for bug localization,” Machine Learning, vol. 111, no. 3, pp. 853–870, 2022.
  31. X. Chen and K. He, “Exploring simple siamese representation learning,” in Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’21), 2021, pp. 15 750–15 758.
  32. T. Gao, X. Yao, and D. Chen, “SimCSE: Simple contrastive learning of sentence embeddings,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP’21), 2021, pp. 6894–6910.
  33. Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learning via non-parametric instance discrimination,” in Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’18), 2018, pp. 3733–3742.
  34. M. Ye, X. Zhang, P. C. Yuen, and S.-F. Chang, “Unsupervised embedding learning via invariant and spreading instance feature,” in Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’19), 2019, pp. 6210–6219.
  35. A. van den Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv, vol. 1807.03748, 2018.
  36. Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in Proceedings of the 15th European Conference on Computer Vision (ECCV’20), 2020, pp. 776–794.
  37. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Proceedings of the 37th International Conference on Machine Learning (ICML’20), 2020, p. 11.
  38. T. Chen, S. Kornblith, K. Swersky, M. Norouzi, and G. E. Hinton, “Big self-supervised models are strong semi-supervised learners,” in Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), 2020, pp. 22 243–22 255.
  39. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch, B. Avila Pires, Z. Guo, M. Gheshlaghi Azar et al., “Bootstrap your own latent-a new approach to self-supervised learning,” in Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), 2020, pp. 21 271–21 284.
  40. X. Wang, Q. Wu, H. Zhang, C. Lyu, X. Jiang, Z. Zheng, L. Lyu, and S. Hu, “HELoC: Hierarchical contrastive learning of source code representation,” in Proceedings of the 30th IEEE/ACM International Conference on Program Comprehension (ICPC’22), 2022, pp. 354–365.
  41. Z. Chen and M. Martin, “A literature study of embeddings on source code,” ArXiv, vol. 1904.03061, 2019.
  42. N. D. Q. Bui, Y. Yu, and L. Jiang, “InferCode: Self-supervised learning of code representations by predicting subtrees,” in Proceedings of the 43rd IEEE/ACM International Conference on Software Engineering (ICSE’21), 2021, pp. 1186–1197.
  43. P. Jain, A. Jain, T. Zhang, P. Abbeel, J. Gonzalez, and I. Stoica, “Contrastive code representation learning,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP’21), 2021, pp. 5954–5971.
  44. E. H. Friend, “Sorting on electronic computer systems,” J. ACM, vol. 3, no. 3, pp. 134–168, 1956.
  45. J. Vig, “A multiscale visualization of attention in the transformer model,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 2019, pp. 37–42.
  46. P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proceedings of the 56th Conference of the North American Chapter of the Association for Computational Linguistics (ACL’18), 2018, pp. 464–468.
  47. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in Proceedings of the 10th International Conference on Learning Representations (ICLR’22), 2022, pp. 1–26.
  48. R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (ACL’16), 2016, pp. 1715–1725.
  49. M. Schuster and K. Nakajima, “Japanese and Korean voice search,” in Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP’12), 2012, pp. 5149–5152.
  50. T. Kudo, “Subword regularization: Improving neural network translation models with multiple subword candidates,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL’18), 2018, pp. 66–75.
  51. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv, vol. 1907.11692, 2019.
  52. L. A. Jeni, J. F. Cohn, and F. De La Torre, “Facing imbalanced data–recommendations for the use of performance metrics,” in Proceedings of the Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII’13), 2013, pp. 245–251.
  53. J. Svajlenko, J. F. Islam, I. Keivanloo, C. K. Roy, and M. M. Mia, “Towards a big data curated benchmark of inter-project code clones,” in Proceedings of the 30th IEEE International Conference on Software Maintenance and Evolution (ICSME’14), 2014, pp. 476–480.
  54. W. Wang, G. Li, B. Ma, X. Xia, and Z. Jin, “Detecting code clones with graph neural network and flow-augmented abstract syntax tree,” in Proceedings of the 27th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER’20), 2020, pp. 261–271.
  55. V. Saini, F. Farmahinifarahani, Y. Lu, P. Baldi, and C. V. Lopes, “Oreo: Detection of clones in the twilight zone,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE’18), 2018, pp. 354–365.
  56. A. Sheneamer, “CCDLC detection framework-combining clustering with deep learning classification for semantic clones,” in Proceedings of the 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA’18), 2018, pp. 701–706.
  57. A. Sheneamer, H. Hazazi, S. Roy, and J. Kalita, “Schemes for labeling semantic code clones using machine learning,” in Proceedings of the 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA’17), 2017, pp. 981–985.
  58. M. Tufano, C. Watson, G. Bavota, M. Di Penta, M. White, and D. Poshyvanyk, “Deep learning similarities from different representations of source code,” in Proceedings of the 15th International Conference on Mining Software Repositories (MSR’18), 2018, pp. 542–553.
  59. L. Jiang, G. Misherghi, Z. Su, and S. Glondu, “DECKARD: Scalable and accurate tree-based detection of code clones,” in Proceedings of the 29th International Conference on Software Engineering (ICSE’07), 2007, pp. 96–105.
  60. H. Sajnani, V. Saini, J. Svajlenko, C. K. Roy, and C. V. Lopes, “SourcererCC: Scaling code clone detection to big-code,” in Proceedings of the IEEE/ACM 38th International Conference on Software Engineering (ICSE’16), 2016, pp. 1157–1168.
  61. M. White, M. Tufano, C. Vendome, and D. Poshyvanyk, “Deep learning code fragments for code clone detection,” in Proceedings of the 31st IEEE/ACM International Conference on Automated Software Engineering (ASE’16), 2016, pp. 87–98.
  62. H. J. Kang, T. F. Bissyandé, and D. Lo, “Assessing the generalizability of code2vec token embeddings,” in Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering (ASE’19), 2019, pp. 1–12.
  63. L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.
  64. J. D. Hunter, “Matplotlib: A 2d graphics environment,” Computing in Science & Engineering, vol. 9, no. 3, pp. 90–95, 2007.
  65. Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14), 2014, pp. 1746–1751.
  66. W. Zaremba and I. Sutskever, “Learning to execute,” arXiv, vol. 1410.4615, 2015.
  67. X. Huo and M. Li, “Enhancing the unified features to locate buggy files by exploiting the sequential nature of source code,” in Proceedings of the 26th International Joint Conference on Artificial Intelligence (IJCAI’17), 2017, pp. 1909–1915.
  68. M. Allamanis, M. Brockschmidt, and M. Khademi, “Learning to represent programs with graphs,” arXiv, vol. 1711.00740, 2017.
  69. M. Tufano, C. Watson, G. Bavota, M. D. Penta, M. White, and D. Poshyvanyk, “Deep learning similarities from different representations of source code,” in Proceedings of the 15th International Conference on Mining Software Repositories (MSR’18), 2018, pp. 542–553.
  70. M. Ou, P. Cui, J. Pei, Z. Zhang, and W. Zhu, “Asymmetric transitivity preserving graph embedding,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), 2016, pp. 1105–1114.
  71. Y. Li, R. Zemel, M. Brockschmidt, and D. Tarlow, “Gated graph sequence neural networks,” in Proceedings of the 4th International Conference on Learning Representations (ICLR’16), 2016.
  72. I. Cho, D. Towey, and P. Kar, “Using obfuscators to test compilers: A metamorphic experience,” in Proceedings of the IEEE 47th Annual Computers, Software, and Applications Conference (COMPSAC’23), 2023, pp. 1786–1791.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zixiang Xian (2 papers)
  2. Rubing Huang (17 papers)
  3. Dave Towey (19 papers)
  4. Chunrong Fang (71 papers)
  5. Zhenyu Chen (91 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com