Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Code Representation Learning At Scale (2402.01935v1)

Published 2 Feb 2024 in cs.CL

Abstract: Recent studies have shown that code LLMs at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking LLMing and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. DOBF: A deobfuscation pre-training objective for programming languages. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=3ez9BSHTNT.
  2. Static prediction of runtime errors by learning to execute programs with external resource descriptions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=lLp-C5nTdJG.
  3. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374, 2021. URL https://arxiv.org/abs/2107.03374.
  4. A simple framework for contrastive learning of visual representations. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  1597–1607, 2020. URL http://proceedings.mlr.press/v119/chen20j.html.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022. URL https://arxiv.org/abs/2204.02311.
  6. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=r1xMH1BtvB.
  7. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, Minneapolis, Minnesota, June 2019a. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  8. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp.  4171–4186, 2019b. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  9. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1536–1547, 2020a. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
  10. CodeBERT: A pre-trained model for programming and natural languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  1536–1547, Online, November 2020b. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.139. URL https://aclanthology.org/2020.findings-emnlp.139.
  11. Mask-then-fill: A flexible and effective data augmentation framework for event extraction. In Findings of the Association for Computational Linguistics: EMNLP 2022, pp.  4537–4544, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-emnlp.332. URL https://aclanthology.org/2022.findings-emnlp.332.
  12. SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6894–6910, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.552. URL https://aclanthology.org/2021.emnlp-main.552.
  13. Graphcode{bert}: Pre-training code representations with data flow. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=jLoC4ez43PZ.
  14. UniXcoder: Unified cross-modal pre-training for code representation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  7212–7225, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.499. URL https://aclanthology.org/2022.acl-long.499.
  15. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp.  1735–1742. IEEE, 2006.
  16. CoSQA: 20,000+ web queries for code search and question answering. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5690–5700, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.442. URL https://aclanthology.org/2021.acl-long.442.
  17. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019. URL https://arxiv.org/abs/1909.09436.
  18. Contrastive code representation learning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5954–5971, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.482. URL https://aclanthology.org/2021.emnlp-main.482.
  19. Deep learning-based source code complexity prediction, 2023. URL https://openreview.net/forum?id=9irBKvxsw9.
  20. Treebert: A tree-based pre-trained model for programming language. In Uncertainty in Artificial Intelligence, pp.  54–63. PMLR, 2021.
  21. Learning and evaluating contextual embedding of source code. In International conference on machine learning, pp. 5110–5121. PMLR, 2020.
  22. The stack: 3 tb of permissively licensed source code. arXiv preprint arXiv:2211.15533, 2022.
  23. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161, 2023.
  24. CodeRetriever: A large scale contrastive pre-training method for code search. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  2898–2910, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.187. URL https://aclanthology.org/2022.emnlp-main.187.
  25. CodeXGLUE: A machine learning benchmark dataset for code understanding and generation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=6lE4dQXaUcb.
  26. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  27. To tune or not to tune? adapting pretrained representations to diverse tasks. In Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pp.  7–14, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4302. URL https://aclanthology.org/W19-4302.
  28. Deep metric learning via lifted structured feature embedding. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp. 4004–4012, 2016. doi: 10.1109/CVPR.2016.434. URL https://doi.org/10.1109/CVPR.2016.434.
  29. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  30. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  31. Syncobert: Syntax-guided multi-modal contrastive pre-training for code representation. arXiv preprint arXiv:2108.04556, 2021a.
  32. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8696–8708, 2021b. doi: 10.18653/v1/2021.emnlp-main.685. URL https://aclanthology.org/2021.emnlp-main.685.
  33. Pairwise supervised contrastive learning of sentence representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5786–5798, 2021. doi: 10.18653/v1/2021.emnlp-main.467. URL https://aclanthology.org/2021.emnlp-main.467.
  34. Learning dialogue representations from consecutive utterances. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  754–768, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.55. URL https://aclanthology.org/2022.naacl-main.55.
Citations (4)

Summary

  • The paper presents CodeSage with a two-stage pretraining mechanism that combines identifier deobfuscation and a modified masked language modeling to enhance code representation.
  • It employs contrastive learning with strategically chosen hard positives and negatives, leading to superior performance on semantic search and classification tasks.
  • The study demonstrates that tailored token-level denoising and structured training approaches significantly improve model efficiency, offering insights for scaling AI-driven code tools.

Introduction

Artificial intelligence's capacity to comprehend and generate code has progressed markedly with the advent of LLMs. By training on vast quantities of source code, these models transform the capabilities of automated code generation. Despite such advancements, many existing models for code representation learning anchor to relatively small-scale parameters and limited pretraining corpora. Addressing these gaps, the paper presents a novel approach, CodeSage, which introduces a two-stage pretraining scheme, employing a mix of randomness in masking LLMing, structured programming considerations, and contrastive learning with unsupervised hard negative and hard positive pairs.

Methodology

Central to CodeSage is a two-pronged pretraining process aimed at finely tuning the code representations. Initially, encoders are trained with an identifier deobfuscation (DOBF) technique and a modified version of the Masked LLMing (MLM) that deviates from the standard 80-10-10 masking heuristic, suited more to natural language than code. The rationale is that the irregular replacement of tokens can disrupt code structure and semantics. The second stage focuses on contrastive learning, not with naturally occurring pairs, but by strategically selecting hard positives and negatives to ensure the model does not take shortcuts and learns robust representations.

The paper presents numerical results, elaborating on the efficacy of the proposed CodeSage models across a spectrum of sizes – small (130M), base (356M), and large (1.3B) parameters. It unequivocally outperforms existing large-scale models on a wide variety of downstream tasks.

Results and Analysis

The experiment section discloses compelling performance metrics, with CodeSage models achieving significant margins over baselines across semantic search and classification tasks. Particularly noteworthy is CodeSage-large's rendition, setting new benchmarks on multilingual in-language and cross-language code searches, and showing promise in the NL2Code search task.

Detailed ablations suggest customized token-level denoising schemes tailored for source code enhanced learning efficiency. They reveal that the standard 80-10-10 masking strategy leads to suboptimal outcomes, thereby advocating for full-mask approaches in code representation learning. Furthermore, the employment of hard positives and negatives is shown to be crucial in the success of contrastive learning, with the paper indicating a higher benefit for larger models from well-conceived challenging learning objectives.

Implications and Future Work

The comprehensive paper of CodeSage provided in the paper emphasizes the combination of vast pretraining data, diversified learning objectives, and nuanced training strategies to advance code representation learning. The paper identifies the critical components that influence the success of such models and postulates on their potential scaling with size. The implications of these findings are significant for the development of general-purpose Programming Language (PL) embedding models, potentially shaping the trajectory of future AI-driven code generation tools.

With the public release of the related code and model artifacts on GitHub and Hugging Face, CodeSage holds the potential to catalyze community-led improvements and innovations in the domain of generative AI for code. This work substantially contributes to our understanding and may guide researchers and practitioners in building even more powerful AI-driven coding assistants.