Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

112

Code Representation Learning At Scale (2402.01935v1)

Published 2 Feb 2024 in cs.CL

Abstract: Recent studies have shown that code LLMs at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking LLMing and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size.

References (34)

Citations (4)

View on Semantic Scholar

Summary

The paper presents CodeSage with a two-stage pretraining mechanism that combines identifier deobfuscation and a modified masked language modeling to enhance code representation.
It employs contrastive learning with strategically chosen hard positives and negatives, leading to superior performance on semantic search and classification tasks.
The study demonstrates that tailored token-level denoising and structured training approaches significantly improve model efficiency, offering insights for scaling AI-driven code tools.

Introduction

Artificial intelligence's capacity to comprehend and generate code has progressed markedly with the advent of LLMs. By training on vast quantities of source code, these models transform the capabilities of automated code generation. Despite such advancements, many existing models for code representation learning anchor to relatively small-scale parameters and limited pretraining corpora. Addressing these gaps, the paper presents a novel approach, CodeSage, which introduces a two-stage pretraining scheme, employing a mix of randomness in masking LLMing, structured programming considerations, and contrastive learning with unsupervised hard negative and hard positive pairs.

Methodology

Central to CodeSage is a two-pronged pretraining process aimed at finely tuning the code representations. Initially, encoders are trained with an identifier deobfuscation (DOBF) technique and a modified version of the Masked LLMing (MLM) that deviates from the standard 80-10-10 masking heuristic, suited more to natural language than code. The rationale is that the irregular replacement of tokens can disrupt code structure and semantics. The second stage focuses on contrastive learning, not with naturally occurring pairs, but by strategically selecting hard positives and negatives to ensure the model does not take shortcuts and learns robust representations.

The paper presents numerical results, elaborating on the efficacy of the proposed CodeSage models across a spectrum of sizes – small (130M), base (356M), and large (1.3B) parameters. It unequivocally outperforms existing large-scale models on a wide variety of downstream tasks.

Results and Analysis

The experiment section discloses compelling performance metrics, with CodeSage models achieving significant margins over baselines across semantic search and classification tasks. Particularly noteworthy is CodeSage-large's rendition, setting new benchmarks on multilingual in-language and cross-language code searches, and showing promise in the NL2Code search task.

Detailed ablations suggest customized token-level denoising schemes tailored for source code enhanced learning efficiency. They reveal that the standard 80-10-10 masking strategy leads to suboptimal outcomes, thereby advocating for full-mask approaches in code representation learning. Furthermore, the employment of hard positives and negatives is shown to be crucial in the success of contrastive learning, with the paper indicating a higher benefit for larger models from well-conceived challenging learning objectives.

Implications and Future Work

The comprehensive paper of CodeSage provided in the paper emphasizes the combination of vast pretraining data, diversified learning objectives, and nuanced training strategies to advance code representation learning. The paper identifies the critical components that influence the success of such models and postulates on their potential scaling with size. The implications of these findings are significant for the development of general-purpose Programming Language (PL) embedding models, potentially shaping the trajectory of future AI-driven code generation tools.

With the public release of the related code and model artifacts on GitHub and Hugging Face, CodeSage holds the potential to catalyze community-led improvements and innovations in the domain of generative AI for code. This work substantially contributes to our understanding and may guide researchers and practitioners in building even more powerful AI-driven coding assistants.

PDF Markdown

Tweets

https://twitter.com/_akhaliq/status/1754725737569915170

https://twitter.com/ahmadwasi/status/1756103285793292362

https://twitter.com/_philschmid/status/1757426575077720429

https://twitter.com/xwestein/status/1758566570497044661

https://twitter.com/AILucknow/status/1754749476848238819

https://twitter.com/javaeeeee1/status/1756299566427558197