The paper "Revisit Self-Debugging with Self-Generated Tests for Code Generation" explores the efficacy of self-debugging techniques in LLMs for code generation, focusing on scenarios where high-quality tests are unavailable. The paper investigates self-debugging with self-generated tests across diverse programming problems, introducing post-execution and in-execution self-debugging paradigms.
The key contributions and findings are:
- The paper introduces and formalizes two distinct paradigms for self-debugging:
Post-execution self-debugging: This approach validates code correctness by comparing execution outputs with expected outputs. It uses the failed test case, the execution output, and error messages to refine the program, generating a revised version:
$\Tilde{C} = \mathrm{M}(C, X_i, Y_i, \Tilde{Y}_i)$
where:
- $\Tilde{C}$ is the revised program
- is the LLM
- is the initial program
- is the input for the -th test
- is the expected output for the -th test
- $\Tilde{Y}_i$ is the execution output for the -th test
- In-execution self-debugging: This approach leverages intermediate runtime states during program execution, dividing the program into basic blocks , where represents the -th basic block and is the total number of blocks. For each test input , the executor updates the variable set iteratively: , where denotes the set of variables after executing block . The sequence of intermediate states, represented as the execution trace , provides insights for the LLM to refine the program:
$\Tilde{C}=\mathrm{M}(C, X_i, T)$
where:
- $\Tilde{C}$ is the updated program
- is the LLM
- is the initial program
- is the test input
- is the execution trace
- Experimental results on self-contained Python programming tasks from HumanEval, MBPP, and LiveCodeBench, using GPT-4o (2024-05-13), Claude-3.5-Sonnet, Llama-3-70B-Instruct, and Qwen2.5-Coder-7B-Instruct, reveal the following:
- Post-execution self-debugging faces challenges with basic problems like HumanEval and MBPP, but shows potential for improvement on more complex problems in LiveCodeBench.
- The discrepancy is attributed to the bias introduced by self-generated tests, which refers to the misalignment between self-testing labels and true labels. The efficacy of post-execution self-debugging relies on the model's ability to reflect on feedback and recognize faulty feedback.
- In-execution self-debugging minimizes bias by focusing solely on intermediate states during execution, leading to improvements on both basic and competitive tasks.
- The paper analyzes the accuracy of self-generated tests, noting that predicting test outputs is more challenging than generating test inputs. For instance, GPT-4o achieves 97.63% input accuracy and 89.77% output accuracy on HumanEval, with an overall test suite accuracy of 59.15%. Similar trends are observed for other models and benchmarks.
- The paper investigates label changes after the first iteration of self-debugging, observing that self-testing on HumanEval and MBPP is more likely to result in false negative labels, while LiveCodeBench shows more true negative labels due to its challenging problems.
- The paper finds that post-execution self-debugging using label feedback leads to improvements across all difficulty levels on LiveCodeBench for GPT-4o. However, including detailed feedback can decrease performance on easier problems.
- In-execution self-debugging is found to be a potentially effective method by leveraging runtime execution information on both basic and competitive programming problems. It mitigates the bias introduced by self-generated tests but depends heavily on the LLMs' code reasoning capabilities.
- The paper discusses future research directions, including enhancing the quality of LLM-generated tests, implementing iterative refinement processes, and designing sophisticated methods for collecting and analyzing runtime information.