Is Self-Repair a Silver Bullet for Code Generation? (2306.09896v5)

Published 16 Jun 2023 in cs.CL, cs.AI, cs.PL, and cs.SE

Abstract: LLMs have shown remarkable aptitude in code generation, but still struggle to perform complex tasks. Self-repair -- in which the model debugs and repairs its own code -- has recently become a popular way to boost performance in these settings. However, despite its increasing popularity, existing studies of self-repair have been limited in scope; in many settings, its efficacy thus remains poorly understood. In this paper, we analyze Code Llama, GPT-3.5 and GPT-4's ability to perform self-repair on problems taken from HumanEval and APPS. We find that when the cost of carrying out repair is taken into account, performance gains are often modest, vary a lot between subsets of the data, and are sometimes not present at all. We hypothesize that this is because self-repair is bottlenecked by the model's ability to provide feedback on its own code; using a stronger model to artificially boost the quality of the feedback, we observe substantially larger performance gains. Similarly, a small-scale study in which we provide GPT-4 with feedback from human participants suggests that even for the strongest models, self-repair still lags far behind what can be achieved with human-level debugging.

PDF Abstract

Overview of the Paper: Is Self-Repair a Silver Bullet for Code Generation?

This paper provides a critical examination of the self-repair capability in LLMs when applied to code generation tasks. The authors investigate whether self-repair—or the ability of a model to introspectively debug and repair its own code—significantly improves performance outcomes beyond that achieved by simply sampling code snippets independently and evaluating them against provided test cases. They focus their paper on three major LLMs: Code Llama, GPT-3.5, and GPT-4, using complex programming tasks derived from HumanEval and APPS datasets.

Methodology

The paper provides a detailed breakdown of the self-repair methodology that consists of four stages: code generation, code execution, feedback generation, and code repair. The authors model this process as a "repair tree," where a specification branches into initial programs, feedback, and then repaired code versions. Special attention is given to how self-repair compares to independently sampling programs at an equivalent computational budget, a critical comparison that involves bootstrapping techniques to handle computational constraints.

Numerical Results

The quantitative analysis reveals that self-repair is not universally beneficial across the board. Performance improvements are scenario-dependent:

Limited Improvement: For certain models, including Code Llama on HumanEval and GPT-3.5 on simpler tasks, the gains over traditional sampling are modest or negligible.
Significant Gains: GPT-4 shows more pronounced improvements in more challenging scenarios, such as competition-style tasks selected from APPS, particularly when self-repair is combined with a strong base of initial code samples.
Parameter Sensitivity: The authors emphasize that parameter tuning, specifically focusing on the number of initial samples ( $n_p$ ) versus repair attempts ( $n_{fr}$ ), dictates the efficacy of self-repair strategies. Higher diversity in initial samples generally results in better outcomes.

Importance of Feedback Quality

A pivotal finding is that the quality of feedback determines the upper bound of self-repair's effectiveness. Experiments indicate:

Artificial Feedback Improvement: Replacing a weaker model's feedback with that from a superior model yields substantive repair performance enhancements.
Human Feedback Impact: Human-generated feedback further boosts repair efficiency, outperforming the model-generated feedback notably in complex task settings. For instance, substituting GPT-4’s feedback with human feedback increased passing repair rates by a factor of 1.58.

Implications and Future Directions

The paper's findings underscore that while self-repair offers substantial potential, it cannot yet be considered a "silver bullet" for code generation. Models currently lack the nuanced ability to consistently diagnose and correct their errors effectively, hinging heavily on feedback quality to unlock their capabilities fully.

Practical Implications: The insights point towards a potential hybrid model where human expertise supplements LLMs to bridge current gaps in debugging and error correction. Moreover, strategies that optimize the exploration of diverse code solutions initially, followed by focused repair efforts, hold promise in real-world coding scenarios.

Theoretical Implications: Understanding the limitations in self-repair opens new avenues for enhancing LLMs' introspective abilities. This might involve augmented training protocols that emphasize diagnostic proficiency or innovative architectural changes to better parse and utilize feedback.

Conclusion

The research delivers substantive insights into the practical limits and conditions under which self-repair of LLMs is beneficial for code generation tasks. These findings serve as a crucial touchstone for both improving model design and guiding practitioners deploying LLMs in dynamically complex environments. Future work could explore deeper integrations of self-repair methods into the broader landscape of software engineering, leveraging both artificial and human intelligence in tandem.