Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models (2406.12326v1)

Published 18 Jun 2024 in cs.SE and cs.AI

Abstract: Recently, large code generation models trained in a self-supervised manner on extensive unlabeled programming language data have achieved remarkable success. While these models acquire vast amounts of code knowledge, they perform poorly on code understanding tasks, such as code search and clone detection, as they are specifically trained for generation. Pre-training a larger encoder-only architecture model from scratch on massive code data can improve understanding performance. However, this approach is costly and time-consuming, making it suboptimal. In this paper, we pioneer the transfer of knowledge from pre-trained code generation models to code understanding tasks, significantly reducing training costs. We examine effective strategies for enabling decoder-only models to acquire robust code representations. Furthermore, we introduce CL4D, a contrastive learning method designed to enhance the representation capabilities of decoder-only models. Comprehensive experiments demonstrate that our approach achieves state-of-the-art performance in understanding tasks such as code search and clone detection. Our analysis shows that our method effectively reduces the distance between semantically identical samples in the representation space. These findings suggest the potential for unifying code understanding and generation tasks using a decoder-only structured model.

Authors (5)

Jiayi Lin (14 papers)
Yutao Xie (10 papers)
Yue Yu (343 papers)
Yibiao Yang (8 papers)
Lei Zhang (1689 papers)

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Toward Exploring the Code Understanding Capabilities of Pre-trained Code Generation Models (2406.12326v1)

Summary

Related Papers