Large Language Models are Qualified Benchmark Builders: Rebuilding Pre-Training Datasets for Advancing Code Intelligence Tasks (2504.19444v1)

Published 28 Apr 2025 in cs.SE and cs.CL

Abstract: Pre-trained code models rely heavily on high-quality pre-training data, particularly human-written reference comments that bridge code and natural language. However, these comments often become outdated as software evolves, degrading model performance. LLMs excel at generating high-quality code comments. We investigate whether replacing human-written comments with LLM-generated ones improves pre-training datasets. Since standard metrics cannot assess reference comment quality, we propose two novel reference-free evaluation tasks: code-comment inconsistency detection and semantic code search. Results show that LLM-generated comments are more semantically consistent with code than human-written ones, as confirmed by manual evaluation. Leveraging this finding, we rebuild the CodeSearchNet dataset with LLM-generated comments and re-pre-train CodeT5. Evaluations demonstrate that models trained on LLM-enhanced data outperform those using original human comments in code summarization, generation, and translation tasks. This work validates rebuilding pre-training datasets with LLMs to advance code intelligence, challenging the traditional reliance on human reference comments.

Summary

LLMs as Benchmark Builders for Code Intelligence

The paper presents a significant exploration of using LLMs to improve the quality of pre-training datasets essential for code intelligence tasks, focusing particularly on the replacement of human-written code comments with those generated by LLMs. This paper addresses the profound impact of comment quality on model performance and introduces innovative metrics for evaluating code comment consistency.

Evaluation of Code Comments

The effectiveness of code intelligence models depends significantly on the quality of the comments used in pre-training. Comments serve as a crucial interface between programming languages (PLs) and natural languages (NLs), aiding in contextual understanding. However, as software evolves, human-written comments often become inconsistent with the code, reducing model efficacy. The paper evaluates whether LLM-generated comments, known for their superior generation capabilities, can outperform traditional human-written ones.

Methodology

To achieve a robust comparison, the paper employs both reference-based and reference-free evaluation metrics. While traditional metrics like BLEU, ROUGE, and METEOR offer insights into grammatical and structural similarity, the paper argues for the inadequacy of these methods in capturing semantic alignment between code and comments. Instead, it proposes two novel reference-free approaches: code-comment inconsistency detection and semantic code search.

Code-Comment Inconsistency Detection (CCID): This metric evaluates the semantic coherence between code and its comments, highlighting discrepancies that might confuse developers.
Semantic Code Search: Here, comments are used as queries to identify the correct code snippet, with the efficiency of retrieval serving as an indicator of comment quality.

The evaluation reveals that LLM-generated comments demonstrate a lower semantic inconsistency rate and higher utility in code search tasks compared to human references.

Impact on Code Intelligence Tasks

The research further explores the practical implications of using LLM-enhanced datasets for pre-training code models specifically, the CodeT5 model, using a dataset (CodeSearchNet) rebuilt with LLM-generated comments.

Experimental Findings

Pre-trained models using these rebuilt datasets exhibited better performance across several code intelligence tasks, including:

Code Summarization: Models trained with LLM-enhanced data significantly outperformed those using original human reference comments in terms of semantic representation.
Code Generation: Enhanced natural language descriptions improved the generation quality and accuracy of the models, with notable gains in metrics like BLEU and exact match.
Code Translation: LLM-generated comments contributed to superior cross-language code generation.

Interestingly, tasks predominantly rooted in structural code understanding, such as code refinement and clone detection, showed minimal improvement, emphasizing the importance of comment quality in tasks requiring NL understanding.

Practical and Theoretical Implications

The paper suggests reevaluating the reliance on human-written comments for code intelligence tasks, advocating for a shift towards LLM-generated alternatives due to their enhanced consistency and semantic alignment. This perspective paves the way for future advancements in dataset construction, wherein LLMs could systematically improve the quality of data used in training sophisticated models, potentially augmenting efficiency in software development and understanding.

Speculation on Future Developments

Looking forward, the integration of LLM-generated comments into broader software development processes can enhance both automated and human-centric aspects of code management. It suggests a future where automated tools become increasingly sophisticated, capable of maintaining high levels of code quality and documentation without extensive human oversight.

In summary, the research validates the potential of LLMs as benchmark builders that can significantly advance the quality of pre-training datasets for code intelligence tasks, urging a reconsideration of current practices in dataset reliance and offering a pathway for using LLMs in enhancing code-related tasks.