A Systematic Evaluation of LLMs of Code
The paper presents a comprehensive evaluation of LLMs specifically tailored for code generation. The evaluation includes Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot. Notably, the research introduces PolyCoder, an open-source model derived from the GPT-2 architecture with a parameter size of 2.7 billion, trained exclusively on a 249GB code corpus spanning 12 programming languages.
Methodology
The research addresses a significant gap in open-source models for code by systematically evaluating both proprietary and open-source LLMs. Given that Codex and similar models are not publicly accessible, the paper benchmarks them against open-source alternatives to analyze performance implications across languages.
PolyCoder, the model introduced, was trained on a multi-lingual corpus using a systematic process that included data deduplication and filtering to accommodate varying repository sizes and ensure quality. The model architecture and training paradigms were optimized within the computational constraints, highlighting how resource limitations impact model training and performance.
Results and Findings
- Performance Comparison:
- Codex outperforms all open models except in C programming, where PolyCoder demonstrates superior performance, achieving lower perplexity.
- Open-source models trained on mixed text corpora (GPT-Neo, GPT-J) showed competitive results but lacked consistency across different programming languages compared to Codex.
- Impact of Data:
- Training on a combination of natural and programming languages proved beneficial, as evidenced by Codex's performance.
- Training models exclusively on code can result in strong specialization, as seen with PolyCoder's performance in C.
- Model Scaling:
- Larger models, like GPT-J and GPT-NeoX, generally display improved performance, but gains in perplexity and HumanEval metrics suggest diminishing returns beyond a certain parameter size in some languages.
- The importance of a diverse dataset is underscored through PolyCoder's ability to generalize across languages despite its narrower initial focus.
Implications and Future Directions
This paper emphasizes the need for more accessible large-scale models to democratize research capabilities in code generation. The findings advocate for the inclusion of diverse datasets encompassing both programming code and related texts to enhance model efficacy across multiple languages.
Future work may explore the balance between multilingual training data and single-language optimization, potentially leading to hybrid models that effectively combine code and natural language understanding. As hardware and computational costs remain barriers, optimizing model efficiency and interpretability should also be a priority.
By releasing PolyCoder, the authors contribute to ongoing efforts to provide robust, open-source tools for the research community, aiming to bridge the gap between academic research and proprietary LLM capabilities. This research sets a foundation for future explorations into model architectures and methodologies that support a wider range of programming languages and applications in AI-driven code generation.