Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Systematic Evaluation of Large Language Models of Code (2202.13169v3)

Published 26 Feb 2022 in cs.PL and cs.CL
A Systematic Evaluation of Large Language Models of Code

Abstract: LLMs (LMs) of code have recently shown tremendous promise in completing code and synthesizing code from natural language descriptions. However, the current state-of-the-art code LMs (e.g., Codex (Chen et al., 2021)) are not publicly available, leaving many questions about their model and data design decisions. We aim to fill in some of these blanks through a systematic evaluation of the largest existing models: Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot, across various programming languages. Although Codex itself is not open-source, we find that existing open-source models do achieve close results in some programming languages, although targeted mainly for natural LLMing. We further identify an important missing piece in the form of a large open-source model trained exclusively on a multi-lingual corpus of code. We release a new model, PolyCoder, with 2.7B parameters based on the GPT-2 architecture, which was trained on 249GB of code across 12 programming languages on a single machine. In the C programming language, PolyCoder outperforms all models including Codex. Our trained models are open-source and publicly available at https://github.com/VHellendoorn/Code-LMs, which enables future research and application in this area.

A Systematic Evaluation of LLMs of Code

The paper presents a comprehensive evaluation of LLMs specifically tailored for code generation. The evaluation includes Codex, GPT-J, GPT-Neo, GPT-NeoX-20B, and CodeParrot. Notably, the research introduces PolyCoder, an open-source model derived from the GPT-2 architecture with a parameter size of 2.7 billion, trained exclusively on a 249GB code corpus spanning 12 programming languages.

Methodology

The research addresses a significant gap in open-source models for code by systematically evaluating both proprietary and open-source LLMs. Given that Codex and similar models are not publicly accessible, the paper benchmarks them against open-source alternatives to analyze performance implications across languages.

PolyCoder, the model introduced, was trained on a multi-lingual corpus using a systematic process that included data deduplication and filtering to accommodate varying repository sizes and ensure quality. The model architecture and training paradigms were optimized within the computational constraints, highlighting how resource limitations impact model training and performance.

Results and Findings

  1. Performance Comparison:
    • Codex outperforms all open models except in C programming, where PolyCoder demonstrates superior performance, achieving lower perplexity.
    • Open-source models trained on mixed text corpora (GPT-Neo, GPT-J) showed competitive results but lacked consistency across different programming languages compared to Codex.
  2. Impact of Data:
    • Training on a combination of natural and programming languages proved beneficial, as evidenced by Codex's performance.
    • Training models exclusively on code can result in strong specialization, as seen with PolyCoder's performance in C.
  3. Model Scaling:
    • Larger models, like GPT-J and GPT-NeoX, generally display improved performance, but gains in perplexity and HumanEval metrics suggest diminishing returns beyond a certain parameter size in some languages.
    • The importance of a diverse dataset is underscored through PolyCoder's ability to generalize across languages despite its narrower initial focus.

Implications and Future Directions

This paper emphasizes the need for more accessible large-scale models to democratize research capabilities in code generation. The findings advocate for the inclusion of diverse datasets encompassing both programming code and related texts to enhance model efficacy across multiple languages.

Future work may explore the balance between multilingual training data and single-language optimization, potentially leading to hybrid models that effectively combine code and natural language understanding. As hardware and computational costs remain barriers, optimizing model efficiency and interpretability should also be a priority.

By releasing PolyCoder, the authors contribute to ongoing efforts to provide robust, open-source tools for the research community, aiming to bridge the gap between academic research and proprietary LLM capabilities. This research sets a foundation for future explorations into model architectures and methodologies that support a wider range of programming languages and applications in AI-driven code generation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Frank F. Xu (27 papers)
  2. Uri Alon (40 papers)
  3. Graham Neubig (342 papers)
  4. Vincent J. Hellendoorn (16 papers)
Citations (545)
Github Logo Streamline Icon: https://streamlinehq.com