CommitBERT: Commit Message Generation Using Pre-Trained Programming Language Model (2105.14242v1)

Published 29 May 2021 in cs.CL

Abstract: Commit message is a document that summarizes source code changes in natural language. A good commit message clearly shows the source code changes, so this enhances collaboration between developers. Therefore, our work is to develop a model that automatically writes the commit message. To this end, we release 345K datasets consisting of code modification and commit messages in six programming languages (Python, PHP, Go, Java, JavaScript, and Ruby). Similar to the neural machine translation (NMT) model, using our dataset, we feed the code modification to the encoder input and the commit message to the decoder input and measure the result of the generated commit message with BLEU-4. Also, we propose the following two training methods to improve the result of generating the commit message: (1) A method of preprocessing the input to feed the code modification to the encoder input. (2) A method that uses an initial weight suitable for the code domain to reduce the gap in contextual representation between programming language (PL) and natural language (NL). Training code, dataset, and pre-trained weights are available at https://github.com/graykode/commit-autosuggestions

Authors (1)

Tae-Hwan Jung (2 papers)

Citations (24)

View on Semantic Scholar

Summary

The paper presents a novel method for generating commit messages using CodeBERT and a curated 345K dataset.
It optimizes input by focusing solely on code modifications, improving summarization performance and BLEU-4 scores.
The research offers actionable insights for automating documentation in software development and reducing developer cognitive load.

CommitBERT: Commit Message Generation Using Pre-Trained Programming LLM

The paper introduces a novel approach to automate the generation of commit messages by utilizing a pre-trained programming LLM, specifically CodeBERT, to bridge the contextual gap between programming and natural languages. The core contributions include the creation of a 345K pair dataset from code modifications and corresponding commit messages in six programming languages, alongside leveraging CodeBERT to enhance model performance in commit message generation.

Key Contributions and Methodologies

Dataset Collection and Processing: The authors have systematically collected and curated a dataset of code modifications paired with commit messages from GitHub repositories, covering multiple languages such as Python, PHP, Java, and others. The dataset meticulously distinguishes between added and deleted code parts, thereby improving the contextual understanding needed for accurate commit message generation.
Utilization of CodeBERT: The research employs CodeBERT as a foundational model to address the challenging task of generating coherent commit messages. This approach capitalizes on CodeBERT’s training on the code domain to better capture the nuanced relationship between code modifications and their natural language summaries.
Input Optimization: By only considering changed lines rather than complete code diffs, the authors optimize the encoder input, improving summarization performance. This selective approach to input ensures that the model isn’t burdened with unnecessary context, thus enhancing its focus on relevant modifications.
Performance Evaluation: The results demonstrate that initializing with CodeBERT, particularly when further pre-trained on a Code-to-NL task, yields superior BLEU-4 scores and lower perplexity across multiple programming languages compared to random initialization and other baselines.

Implications and Future Directions

The implications of this research are significant in the field of software development and collaborative coding environments. By automating the creation of commit messages, the tool can significantly reduce cognitive load on developers, allowing them to focus more on code quality and functionality rather than documentation.

From a theoretical perspective, integrating pre-trained models like CodeBERT in cross-domain tasks presents a promising direction for future studies, particularly in harmonizing different domains such as code and natural language. These findings suggest promising applicability in other code-related tasks such as automatic code review or documentation generation.

Future development can explore integrating syntactic analysis to further enhance the understanding of code modifications, as the authors mentioned the potential benefit of transforming code into abstract syntax trees before encoding. Additionally, increasing the diversity of programming languages in the dataset could broaden the model's applicability across different programming ecosystems.

Conclusion

This paper presents a structured and innovative methodology for commit message generation, leveraging pre-trained models to optimize the interaction between programming and natural languages. Through meticulous data curation and strategic model utilization, the authors provide a robust framework with considerable practical implications in collaborative software development. Future research can explore the incorporation of more advanced syntactic features to further enhance performance and applicability.

Related Papers

GitHub

GitHub - graykode/commit-autosuggestions: A tool that AI automatically recommends commit messages. (389 stars)