Emergent Mind

Evaluating Large Language Models Trained on Code

Published Jul 7, 2021 in cs.LG


We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.


  • Codex, fine-tuned from GPT on GitHub code, powers GitHub Copilot, showcasing superior performance in generating Python code.

  • The evaluation employed the pass@k metric, highlighting Codex’s ability to produce functionally correct output, with significant success on the HumanEval dataset.

  • While Codex demonstrates an adept understanding of algorithmic tasks, it shows limitations with complex operations and poses security concerns due to potential insecure code suggestions.

  • The paper discusses the broader implications of models like Codex on the software development ecosystem, security concerns, and the labor market, suggesting a future of AI-enhanced coding but recommending careful integration.

Introduction to Codex

The rapidly evolving field of AI has brought forth Codex, a model fine-tuned from GPT on publicly available code from GitHub. Its capabilities in generating Python code have been pushed to the limelight, leveraging its prowess to power GitHub Copilot, an AI pair programmer. This assesses its performance, unveiling that on the HumanEval dataset, Codex outshines its predecessor models significantly, achieving a problem-solving rate of 28.8% with a single sample and reaching up to 70.2% with an optimal sampling strategy.

Evaluation Framework

The methodology hinges on the pass@k metric, underscoring the model's ability to generate functional outputs as per predefined unit tests. This shift from conventional match-based metrics to functional correctness not only aligns with modern software development practices, such as test-driven development but also elevates the benchmarking standard for program synthesis.

Within this frame, the evaluation on HumanEval, a hand-crafted dataset optimized for functional correctness, speaks volumes about Codex's understanding of language comprehension, algorithms, and basic mathematics, painting a promising picture of its potential use cases.

Findings and Discussion

The impressive performance of Codex on generating Python functions reveals a nuanced understanding of the task at hand, albeit with an expected degradation in performance for prompts requiring a deeper interface with complex operations and variable bindings. This limitation was further explored using synthetic problems constructed from basic building blocks, indicating an exponential decline in model performance with increasing operation chains, raising concerns about its innate system-level synthesis capabilities.

The study delves into the broader implications across safety, security, and economic perspectives. Notably, the discussion on potential misuse underscores the model's current inadequacy in autonomously generating malicious code, alleviating immediate fears around cybersecurity threats. Conversely, the model's predisposition to suggesting insecure code configurations poses a significant concern, demanding meticulous scrutiny of its outputs.

Future Trajectories and Applications

Looking ahead, the landscape for models like Codex is ripe with both promise and peril. The ability to scale up the nature of tasks tackled poses an interesting trajectory for future developments. Equally, the models' implications on the software development ecosystem, from influencing the package import rates to potentially reshaping software documentation and testing practices, offer fertile ground for further exploration.

The economic and labor market ramifications, while speculative at this juncture, spotlight the need for an elaborate discourse on the adaptive strategies for the workforce. The nuanced understanding of Codex's performance, coupled with a granular evaluation of its outputs, could pave the way for more responsible integration into software development pipelines, ensuring that benefits are harnessed while mitigating associated risks.

In Conclusion

The foray into evaluating LLMs trained on code, epitomized by Codex, emphasizes the tectonic shifts AI is capable of inducing in software development. The findings lay a foundational stone for advancing the model's capabilities, highlighting the crucial balance between leveraging AI's potential and navigating its multifaceted implications on security, economy, and labor. As AI continues to evolve, so too will our strategies for integrating it beneficially into our digital fabric, holding the promise of transforming coding from a primarily human endeavor to a collaborative symphony between man and machine.

Get summaries of trending AI papers delivered straight to your inbox

Unsubscribe anytime.

Test Your Knowledge

You answered out of questions correctly.

Well done!