Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisit Self-Debugging with Self-Generated Tests for Code Generation (2501.12793v1)

Published 22 Jan 2025 in cs.SE and cs.AI

Abstract: LLMs have shown significant advancements in code generation, but still face challenges on tasks beyond their basic capabilities. Recently, the notion of self-debugging has been proposed to boost the performance of code generation by leveraging execution feedback from tests. Despite its promise, the availability of high-quality tests in real-world scenarios is limited. In this context, self-debugging with self-generated tests is a promising solution but lacks a full exploration of its limitations and practical potential. Therefore, we investigate its efficacy on diverse programming problems. To deepen our understanding, we propose two distinct paradigms for the process: post-execution and in-execution self-debugging. Within the scope of self-contained Python programming tasks, we find that post-execution self-debugging struggles on basic problems but shows potential for improvement on competitive ones, due to the bias introduced by self-generated tests. On the other hand, in-execution self-debugging enables LLMs to mitigate the bias by solely leveraging intermediate states during execution, thereby enhancing code generation.

The paper "Revisit Self-Debugging with Self-Generated Tests for Code Generation" explores the efficacy of self-debugging techniques in LLMs for code generation, focusing on scenarios where high-quality tests are unavailable. The paper investigates self-debugging with self-generated tests across diverse programming problems, introducing post-execution and in-execution self-debugging paradigms.

The key contributions and findings are:

  • The paper introduces and formalizes two distinct paradigms for self-debugging:
    • Post-execution self-debugging: This approach validates code correctness by comparing execution outputs with expected outputs. It uses the failed test case, the execution output, and error messages to refine the program, generating a revised version:

      $\Tilde{C} = \mathrm{M}(C, X_i, Y_i, \Tilde{Y}_i)$

      where:

      • $\Tilde{C}$ is the revised program
      • M\mathrm{M} is the LLM
      • CC is the initial program
      • XiX_i is the input for the ii-th test
      • YiY_i is the expected output for the ii-th test
      • $\Tilde{Y}_i$ is the execution output for the ii-th test
    • In-execution self-debugging: This approach leverages intermediate runtime states during program execution, dividing the program into basic blocks C=[B1,B2,...,BK]C = [B^1, B^2, ..., B^K], where BkB^k represents the kk-th basic block and KK is the total number of blocks. For each test input XiX_i, the executor E\mathrm{E} updates the variable set iteratively: Vik+1=E(Bk,Vik)V_i^{k+1} = \mathrm{E}(B^k, V_i^k), where VikV_i^k denotes the set of variables after executing block BkB^k. The sequence of intermediate states, represented as the execution trace T=[B1,Vi1,...,BK,ViK]T=[B^1, V_i^1, ..., B^K, V_i^K], provides insights for the LLM M\mathrm{M} to refine the program:

      $\Tilde{C}=\mathrm{M}(C, X_i, T)$

      where:

      • $\Tilde{C}$ is the updated program
      • M\mathrm{M} is the LLM
      • CC is the initial program
      • XiX_i is the test input
      • TT is the execution trace
  • Experimental results on self-contained Python programming tasks from HumanEval, MBPP, and LiveCodeBench, using GPT-4o (2024-05-13), Claude-3.5-Sonnet, Llama-3-70B-Instruct, and Qwen2.5-Coder-7B-Instruct, reveal the following:
    • Post-execution self-debugging faces challenges with basic problems like HumanEval and MBPP, but shows potential for improvement on more complex problems in LiveCodeBench.
    • The discrepancy is attributed to the bias introduced by self-generated tests, which refers to the misalignment between self-testing labels and true labels. The efficacy of post-execution self-debugging relies on the model's ability to reflect on feedback and recognize faulty feedback.
    • In-execution self-debugging minimizes bias by focusing solely on intermediate states during execution, leading to improvements on both basic and competitive tasks.
  • The paper analyzes the accuracy of self-generated tests, noting that predicting test outputs is more challenging than generating test inputs. For instance, GPT-4o achieves 97.63% input accuracy and 89.77% output accuracy on HumanEval, with an overall test suite accuracy of 59.15%. Similar trends are observed for other models and benchmarks.
  • The paper investigates label changes after the first iteration of self-debugging, observing that self-testing on HumanEval and MBPP is more likely to result in false negative labels, while LiveCodeBench shows more true negative labels due to its challenging problems.
  • The paper finds that post-execution self-debugging using label feedback leads to improvements across all difficulty levels on LiveCodeBench for GPT-4o. However, including detailed feedback can decrease performance on easier problems.
  • In-execution self-debugging is found to be a potentially effective method by leveraging runtime execution information on both basic and competitive programming problems. It mitigates the bias introduced by self-generated tests but depends heavily on the LLMs' code reasoning capabilities.
  • The paper discusses future research directions, including enhancing the quality of LLM-generated tests, implementing iterative refinement processes, and designing sophisticated methods for collecting and analyzing runtime information.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Xiancai Chen (8 papers)
  2. Zhengwei Tao (16 papers)
  3. Kechi Zhang (22 papers)
  4. Changzhi Zhou (4 papers)
  5. Wanli Gu (3 papers)
  6. Yuanpeng He (30 papers)
  7. Mengdi Zhang (37 papers)
  8. Xunliang Cai (63 papers)
  9. Haiyan Zhao (42 papers)
  10. Zhi Jin (160 papers)