Papers
Topics
Authors
Recent
2000 character limit reached

Evaluating Large Language Models Trained on Code (2107.03374v2)

Published 7 Jul 2021 in cs.LG

Abstract: We introduce Codex, a GPT LLM fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.

Citations (4,107)

Summary

  • The paper introduces Codex, a GPT-based model fine-tuned on GitHub code to synthesize accurate Python functions from docstrings using the HumanEval benchmark.
  • The evaluation employs pass@k metrics with repeated sampling, demonstrating significant performance gains over earlier models.
  • The study emphasizes practical implications including improved program synthesis, test-driven development, and responsible deployment of AI tools.

Evaluating LLMs Trained on Code

This paper presents a comprehensive paper of Codex, a GPT-based model fine-tuned on publicly available code on GitHub, aimed at assessing its capabilities in Python code synthesis. The authors introduce Codex as a significant advancement in the field of program synthesis, focusing on its ability to generate functionally correct standalone Python functions from docstrings. The paper provides an in-depth evaluation of Codex's performance using a newly released benchmark called HumanEval, which specifically tests functional correctness.

Introduction and Methodology

Codex is a specialized variant of the GPT LLM, specifically trained to handle code syntax and semantics. Unlike its predecessors like GPT-3, Codex benefits from a dataset rich in programming contexts, which enhances its ability to understand and generate code. The model's primary task is transforming natural language docstrings into executable Python functions. The authors introduce HumanEval, a dataset comprising 164 unique programming tasks with corresponding unit tests, to measure Codex’s performance in synthesizing correct code solutions. Figure 1

Figure 1: Pass rates of our models on the HumanEval dataset as a function of model size. When a single sample is generated for each problem, GPT-12B solves no problems, but Codex solves 28.8% of them, and Codex-S solves 37.7%.

The methodology involves repeatedly sampling code from Codex, leveraging an unbiased estimator for pass@kk metrics, designed to account for the non-trivial production of functionally equivalent solutions that differ from reference solutions. By fine-tuning on standalone functions, the authors further enhance Codex into Codex-S, which displays improved performance measured by its ability to pass rigorous unit tests.

Evaluation Framework

The paper explores the evaluation framework, highlighting the inadequacy of match-based metrics like BLEU scores for code synthesis, given the diverse solutions representing semantic equivalences in programming. Instead, the focus is on functional correctness, proven through successfully passing unit tests, which aligns more closely with real-world software development practices such as test-driven development. Figure 2

Figure 2: Three example problems from the HumanEval dataset, showing varying probabilities of successful code generation by Codex-12B.

The evaluation utilizes the pass@kk metric, which assesses the model's success rate over multiple samples, optimizing sampling temperature for diversity in solutions. This metric is crucial in scenarios where heuristic-based selection of code samples is necessary, reflecting practical applications wherein computational resources for exhaustive validation are limited.

Results

The paper reports compelling numerical results, demonstrating Codex's proficiency in generating code that achieves high pass rates in HumanEval. Codex-S shows significant improvement over Codex and other models like GPT-J and GPT-Neo, validating the model's fine-tuning approach and data-driven improvements. Additionally, the paper explores the impact of model size and architectural adjustments on performance, confirming robust scaling behaviors akin to other LLMs. Figure 3

Figure 3: Pass@k against the number of samples (kk) for various temperature settings, illustrating how diversity in samples contributes to better performance.

Figure 4

Figure 4: Performance scaling of Codex as a sigmoid function in log-parameters, indicative of efficient incremental learning across model sizes.

The paper also provides insights into ranking strategies for generated solutions, emphasizing mean log-probability as a practical heuristic for selecting the most promising sample when multiple evaluations are impractical.

Broader Implications and Future Directions

The authors discuss the broader implications of deploying Codex, including potential social, economic, and ethical considerations. Codex reflects advances in productivity tools, but its use also poses challenges concerning security and bias. The paper stresses the importance of responsible deployment, suggesting mitigations like user interface design, oversight mechanisms, and output filtering to mitigate risks of biases and over-reliance. Figure 5

Figure 5: Codex's performance decreases with subtle bugs in prompts, especially with larger model sizes.

The research suggests future developments in AI-driven code synthesis could democratize access to programming resources, enhancing education and expanding opportunities in software development. At the same time, these developments require careful consideration of societal impacts and alignment with human goals, ensuring that advancements truly benefit users without introducing new risks.

Conclusion

Evaluating Codex underlines the potential of LLMs trained on code to transform program synthesis and software automation. The results affirm the efficacy of data-centric model enhancement, specifically through the context of programming languages. The paper advocates for further explorations into aligning these models with user intents and reducing undesired behaviors, ultimately fostering an environment where technological tools augment human capabilities responsibly.

Whiteboard

Explain it Like I'm 14

What this paper is about (in simple terms)

This paper introduces Codex, a computer program based on GPT-style AI that’s trained to write code, mainly in Python. The authors built a new test called HumanEval to check if Codex can write small, correct functions from short descriptions (called docstrings). They show that Codex can solve a good number of problems—especially if you let it try multiple times—and they carefully discuss where it works well, where it struggles, and what the risks are.

What questions the researchers asked

They set out to answer a few big, easy-to-understand questions:

  • If we train a LLM specifically on lots of public code, will it get good at writing code?
  • How can we fairly test whether the code it writes actually works?
  • Do bigger models do better, and does letting the model try multiple times help?
  • Can we pick a good answer without running tests every time?
  • Can fine-tuning (extra training on carefully chosen coding tasks) make it better?
  • What are the limits and risks of using code-writing AI?

How they tested their ideas (methods explained simply)

Think of Codex as a very advanced “auto-complete” for code:

  • It looks at a short English description of what a function should do (the docstring).
  • Then it tries to write the function that matches that description.

Here’s how the researchers checked its abilities:

  • HumanEval: They created a set of 164 small, original programming problems. Each problem has:
    • A function name and description (docstring)
    • Hidden tests (unit tests) that check whether the function truly works
  • Unit tests: Like a checklist. If the function passes all tests, it’s considered correct.
  • Safe sandbox: They ran the AI’s code in a safe “sandbox” so bad code couldn’t harm the computer.
  • Multiple tries (“pass@k”): Instead of just one shot, they sometimes let Codex try several times (like throwing multiple darts). If any try passes the tests, it counts as a success. “pass@1” means one try; “pass@100” means up to 100 tries.
  • Sampling temperature: This controls how “adventurous” the AI’s suggestions are. Low temperature = safer, more predictable answers; high temperature = more variety and creativity. Higher temperatures helped when taking many tries because it produced more diverse solutions.
  • Picking a single best suggestion: In real tools, you often show one suggestion. They tried a simple trick—choose the answer with the highest average “confidence” (mean log-probability). This worked better than picking randomly.
  • Extra training (Codex-S): They also fine-tuned Codex on thousands of carefully collected function problems (from programming contests and from real projects’ tests) to make it even better at this specific task.
  • Comparisons: They compared Codex against other models (like GPT-J and GPT-Neo) and a commercial code tool (Tabnine).
  • Beyond coding-from-docstrings: They also trained a model (Codex-D) that writes docstrings from code, to explore safety and explainability.
  • Other datasets: They tested on a tougher dataset (APPS) to see how it handles more complex, full-program problems.

Key terms in everyday language:

  • LLM: A system that predicts what text (or code) comes next—like a smart auto-complete.
  • Unit tests: Automatic checks that confirm the program does what it’s supposed to.
  • Sandbox: A safe, locked environment for running untrusted code.
  • pass@k: “Success if any of k tries works.”
  • Log-probability: The model’s internal measure of how likely it thinks its answer is—used here as a rough “confidence” score.

What they found (main results)

Here are the most important results and why they matter:

  • Codex beats general LLMs at coding: On HumanEval, a 12-billion-parameter Codex solved about 28.8% of problems on the first try (pass@1). Similar-size general models solved almost none.
  • Trying multiple times helps a lot: With 100 tries per problem, Codex solved about 70%+ of the set (77.5% for the fine-tuned version, Codex-S).
  • Fine-tuning on the exact task helps: Codex-S (trained on many standalone function problems) did even better: 37.7% on first try and up to 77.5% with multiple tries.
  • Bigger models, better results: Performance generally improved as Codex got larger, following smooth scaling patterns.
  • Picking one suggestion wisely helps: Choosing the suggestion with the highest average confidence beat random choice, which is helpful for tools that show a single completion.
  • “BLEU” scores don’t reliably measure correctness: A text-similarity score like BLEU often didn’t match whether the code actually worked. Passing tests is what really matters.
  • On tougher tasks (APPS), Codex needed many tries, and even then sometimes wrote solutions that worked but were too slow. Using public examples in the prompt and filtering candidates through basic checks helped.

Limits and weaknesses (what Codex struggles with)

The paper is clear that Codex isn’t perfect:

  • Needs lots of training code: It learned from a huge amount of code—far more than most humans ever see.
  • Long, complicated instructions: When the docstring described a long chain of steps, performance dropped sharply as the list grew.
  • Mixing up variables and steps: Sometimes it applied the right operation to the wrong variable or forgot a step.
  • Security and safety: Generated code can be wrong or insecure, so you must still review it. They built a sandbox to reduce risk while testing.
  • Misalignment and over-reliance: The model may produce code that looks right but isn’t what you intended. Novices might trust it too much. As models get more capable, this risk can grow.

Why this matters

  • Helpful coding assistant: Codex can speed up coding by suggesting functions, helping with boilerplate, and aiding learning.
  • Better evaluation: The HumanEval benchmark and the pass@k metric show a fairer way to test code-writing AI: run the code and see if it passes tests.
  • Direction for improvement: Fine-tuning on task-specific data greatly boosts performance. Picking one answer using confidence scores also helps.
  • Caution is required: Even powerful AI can make subtle mistakes. Human oversight, good tests, and security practices remain essential.
  • Broader impact: Tools like Codex can change how people learn and write code, but they also raise questions about safety, fairness, and potential misuse. The authors call for careful deployment and further research on reducing risks like misalignment and over-reliance.

Bottom line

Codex shows that training AI on lots of real code can produce a strong coding assistant that solves many programming tasks, especially if you let it try multiple times or fine-tune it on the exact types of problems you care about. But it isn’t a magic fix: its code must be tested and reviewed, it can be misled by complex instructions, and it can produce insecure or incorrect solutions. Used carefully, it can be a powerful aid; used blindly, it can cause problems.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 16 tweets with 382 likes about this paper.