Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code (2003.07914v1)

Published 17 Mar 2020 in cs.SE

Abstract: Statistical LLMing techniques have successfully been applied to large source code corpora, yielding a variety of new software development tools, such as tools for code suggestion, improving readability, and API migration. A major issue with these techniques is that code introduces new vocabulary at a far higher rate than natural language, as new identifier names proliferate. Both large vocabularies and out-of-vocabulary issues severely affect Neural LLMs (NLMs) of source code, degrading their performance and rendering them unable to scale. In this paper, we address this issue by: 1) studying how various modelling choices impact the resulting vocabulary on a large-scale corpus of 13,362 projects; 2) presenting an open vocabulary source code NLM that can scale to such a corpus, 100 times larger than in previous work; and 3) showing that such models outperform the state of the art on three distinct code corpora (Java, C, Python). To our knowledge, these are the largest NLMs for code that have been reported. All datasets, code, and trained models used in this work are publicly available.

Authors (5)

Rafael-Michael Karampatsis (3 papers)
Hlib Babii (3 papers)
Romain Robbes (18 papers)
Charles Sutton (74 papers)
Andrea Janes (17 papers)

Citations (206)

View on Semantic Scholar

Summary

The paper proposes an open-vocabulary neural model that uses BPE to efficiently manage dynamic source code vocabularies and reduce OOV token rates.
The paper demonstrates superior predictive accuracy and lower cross-entropy in Java, C, and Python codebases compared to state-of-the-art closed-vocabulary models.
The paper highlights the potential of adaptable open-vocabulary models for applications in code auto-completion, bug detection, and improved software engineering tools.

Essay on "Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code"

The paper "Big Code != Big Vocabulary: Open-Vocabulary Models for Source Code" addresses a significant challenge in modeling source code with statistical LLMs (SLMs), specifically regarding vocabulary management. Unlike natural language, source code evolves with highly dynamic vocabularies, attributed to the developer-generated identifiers, which render traditional closed-vocabulary models inefficient. This paper proposes the use of open-vocabulary models for better scalability and performance across large codebases.

Overview of the Research

The researchers focus on several pivotal contributions:

Analysis of Vocabulary Design Choices: The paper begins by critically evaluating the impact of different vocabulary design strategies on LLM performance. Adjustments such as handling comments, string literals, and compounding rules (e.g., camelCase) influence both vocabulary size and out-of-vocabulary (OOV) token rates. Identifiable trends show drastic variations in vocabulary size (up to three orders of magnitude) due to modeling choices. Traditional methods, including token convention splitting based on underscores and case, fall short unless enhanced with techniques like Byte-Pair Encoding (BPE).
Proposal of an Open-Vocabulary Model: A central contribution is the development of a neural LLM (NLM) that leverages BPE to manage source code vocabularies effectively. By utilizing subword units, the proposed model provides a robust approach for predicting OOV tokens and significantly reduces the vocabulary size compared to previous models. The use of BPE allows the model to scale with large code corpora, as evidenced by training on datasets 100 times larger than those used in related work.
Evaluation Against State-of-the-art Models: The proposed model's efficacy is demonstrated by outperforming state-of-the-art n-gram and closed vocabulary neural models across Java, C, and Python codebases. It achieves higher predictive accuracy and reduces cross-entropy rates in code completion tasks. Furthermore, experiments emphasize the model’s capacity to utilize larger training sets effectively, achieving notable improvements when trained on expansive datasets.
Adaptation and Transfer Learning Potential: The model supports a dynamic adaptation strategy for enhancing performance on new code projects, thereby offering potential for integration into various software engineering workflows, such as auto-completion and bug detection. This aligns with recent trends in leveraging pre-training for transfer learning, suggesting broader applicability in software maintenance and development tools.

Implications and Future Directions

The implications of this research are substantial, offering numerous avenues for further exploration. By demonstrating the scalability of open-vocabulary models with BPE for source code, this work provides a foundation for developing more sophisticated tools for software engineering. Potential applications include code readability enhancement, automated bug detection, and even assisting in learning relevant semantic associations for program synthesis from diverse repositories.

The success of BPE in managing source code vocabulary highlights the importance of subword segmentation methodologies heavily utilized in natural language processing, thus inviting cross-disciplinary exploration and integration. Future developments could see the integration of transformers and other advanced architectures to further improve model capabilities and efficiency. Moreover, adapting these open-vocabulary networks for use in hybrid systems alongside syntax-based models could yield even better results, leveraging structural insights into code behavior.

Conclusion

In addressing the pivotal challenges of vocabulary management for SLMs in source code, this research contributes a valuable framework for the development of scalable, open-vocabulary models. By enabling the efficient handling of expansive code vocabularies, it extends the horizons for future applications of machine learning in software engineering, paving the way for innovations in tool development and code analysis methodologies.

PDF Markdown