Better Language Models of Code through Self-Improvement (2304.01228v2)

Published 2 Apr 2023 in cs.CL and cs.AI

Abstract: Pre-trained LLMs for code (PLMCs) have gained attention in recent research. These models are pre-trained on large-scale datasets using multi-modal objectives. However, fine-tuning them requires extensive supervision and is limited by the size of the dataset provided. We aim to improve this issue by proposing a simple data augmentation framework. Our framework utilizes knowledge gained during the pre-training and fine-tuning stage to generate pseudo data, which is then used as training data for the next step. We incorporate this framework into the state-of-the-art LLMs, such as CodeT5, CodeBERT, and UnixCoder. The results show that our framework significantly improves PLMCs' performance in code-related sequence generation tasks, such as code summarization and code generation in the CodeXGLUE benchmark.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (4)

Hung Quoc To (2 papers)
Nghi D. Q. Bui (30 papers)
Jin Guo (42 papers)
Tien N. Nguyen (24 papers)

Citations (13)

View on Semantic Scholar

Tweets

https://twitter.com/sawubonagmbh/status/1748757673464709504

Better Language Models of Code through Self-Improvement (2304.01228v2)

Related Papers

Tweets