Papers
Topics
Authors
Recent
Search
2000 character limit reached

Calibration and Correctness of Language Models for Code

Published 3 Feb 2024 in cs.SE and cs.LG | (2402.02047v4)

Abstract: Machine learning models are widely used, but can also often be wrong. Users would benefit from a reliable indication of whether a given output from a given model should be trusted, so a rational decision can be made whether to use the output or not. For example, outputs can be associated with a confidence measure; if this confidence measure is strongly associated with likelihood of correctness, then the model is said to be well-calibrated. A well-calibrated confidence measure can serve as a basis for rational, graduated decision-making on how much review and care is needed when using generated code. Calibration has so far been studied in mostly non-generative (e.g. classification) settings, especially in software engineering. However, generated code can quite often be wrong: Given generated code, developers must decide whether to use directly, use after varying intensity of careful review, or discard model-generated code. Thus, calibration is vital in generative settings. We make several contributions. We develop a framework for evaluating the calibration of code-generating models. We consider several tasks, correctness criteria, datasets, and approaches, and find that, by and large, generative code models we test are not well-calibrated out of the box. We then show how calibration can be improved using standard methods, such as Platt scaling. Since Platt scaling relies on the prior availability of correctness data, we evaluate the applicability and generalizability of Platt scaling in software engineering, discuss settings where it has good potential for practical use, and settings where it does not. Our contributions will lead to better-calibrated decision-making in the current use of code generated by LLMs, and offers a framework for future research to further improve calibration methods for generative models in software engineering.

Citations (8)

Summary

  • The paper demonstrates that intrinsic confidence measures, when rescaled with Platt scaling, significantly enhance calibration in code generation tasks.
  • It compares intrinsic versus reflective methods across function synthesis, code completion, and program repair, highlighting differences in consistency.
  • Findings imply that better-calibrated LLMs aid in effective resource allocation for code review and risk mitigation in software development.

Calibration and Correctness of LLMs for Code

Introduction

The integration of generative LLMs into software engineering has become widespread, particularly in tasks such as code completion, review, and automatic code generation. These models are, however, prone to generating incorrect code, thus necessitating an effective mechanism to gauge the confidence in their predictions. The calibration of LLMs refers to the statistical alignment between predicted confidence levels and actual correctness rates—a well-calibrated model provides confidence scores that closely match the empirical probability of correctness. This paper presents a comprehensive study on the calibration of code-generating LLMs, particularly focusing on the efficacy of various confidence measures and calibration methods.

Background

Calibration Fundamentals

Calibration in machine learning models, especially in prediction tasks, ensures that the model's confidence outputs reflect true correctness probabilities. A model is considered well-calibrated if, for instance, out of all instances where the model predicts an event with 70% probability, the event actually occurs approximately 70% of the time. This principle is crucial for software development, where wrongly calibrated models may lead to inefficient resource allocation in code review processes.

Code-Generative Tasks

The text outlines three primary tasks where LLMs are extensively utilized:

  1. Function Synthesis: Generation of entire function bodies from natural language descriptions (Docstrings).
  2. Line-Level Code Completion: Predicting the next line or sequence of code based on preceding context.
  3. Program Repair: Correcting buggy code snippets into functionally correct versions.

For each task, both exact-match correctness and test-passing correctness are considered, acknowledging the complexities and potential inadequacies inherent in each approach.

Methodology

Datasets and Models

The study utilizes multiple datasets, including DyPyBench for real-world line-level completion, HumanEval and MBPP for function synthesis, and Defects4J and SStubs for program repair. The models evaluated include OpenAI's Codex and GPT-3.5-Turbo-Instruct, as well as the open-source CodeGen2, each offering varying capabilities and scale.

Confidence Measures

  • Intrinsic Measures: Average Token Probability and Total Generated Sequence Probability, derived natively from model output probabilities.
  • Reflective Measures: Verbalized Self-Ask and Question Answering probabilistic logits, obtained through additional model queries designed to assess self-predicted confidence in outputs.

Calibration Metrics

The calibration performance is quantified using:

  • Brier Score: Measures the mean squared difference between predicted probabilities and observed outcomes.
  • Expected Calibration Error (ECE): Represents the average deviation of predicted probabilities from true correctness probabilities across binned confidence levels.
  • Skill Score (SS): Compares model performance against a naive (unskilled) predictor.

Results

Calibration Analysis

The study finds that without any rescaling, intrinsic measures generally perform slightly better in calibration than reflective ones. However, the application of Platt scaling significantly enhances calibration, particularly for intrinsic measures such as Total Probability, bringing models like Codex close to a Skill Score improvement of 0.15 in function synthesis tasks. Figure 1

Figure 1: AUC vs Scaled Skill Score (right) for GPT-3.5 confidence measures on all tasks.

RQ Findings

  1. Intrinsic Calibration: Measures like Total Probability show reasonable calibration with rescaling, albeit with some inconsistencies.
  2. Reflective Calibration: Reflective measures, particularly those based on direct model output like Verbalized Self-Ask, generally exhibit poorer calibration compared to intrinsic measures.
  3. Impact of Rescaling: Platt scaling improves the Skill Score and reduces ECE across tasks, confirming its utility in refining confidence measures. Figure 2

    Figure 2: Calibration plots per model, confidence measure, and task.

Discussion

The enhancement of calibration through rescaling suggests a path forward for utilizing confidence measures in real-world software engineering. The capability to distinguish between varying levels of correctness allows developers to allocate review resources more effectively, mitigate risks, and potentially integrate more sophisticated, probabilistic quality-control processes into software development practices.

Conclusion

Generative LLMs hold promise for increasing productivity in software engineering. However, the study highlights the necessity of well-calibrated confidence measures to maximize their utility while minimizing risks. Future research should focus on refining these confidence measures, exploring alternative calibration techniques, and validating findings across larger, diverse datasets to ensure robust integration into software development workflows.

Paper to Video (Beta)

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 6 tweets with 48 likes about this paper.