Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step (2402.16906v6)

Published 25 Feb 2024 in cs.SE, cs.AI, and cs.CL
Debug like a Human: A Large Language Model Debugger via Verifying Runtime Execution Step-by-step

Abstract: LLMs are leading significant progress in code generation. Beyond one-pass code generation, recent works further integrate unit tests and program verifiers into LLMs to iteratively refine the generated programs. However, these works consider the generated programs as an indivisible entity, which falls short for LLMs in debugging the programs, especially when the programs contain complex logic flows and data operations. In contrast, when human developers debug programs, they typically set breakpoints and selectively examine runtime execution information. The execution flow and the intermediate variables play a crucial role in the debugging process, yet they are underutilized in the existing literature on code generation. In this study, we introduce LLM Debugger (LDB), a novel debugging framework that enables LLMs to refine their generated programs with the runtime execution information. Specifically, LDB segments the programs into basic blocks and tracks the values of intermediate variables after each block throughout the runtime execution. This allows LLMs to concentrate on simpler code units within the overall execution flow, verify their correctness against the task description block by block, and efficiently pinpoint any potential errors. Experiments demonstrate that LDB consistently enhances the baseline performance by up to 9.8% across the HumanEval, MBPP, and TransCoder benchmarks, archiving new state-of-the-art performance in code debugging for various LLM selections.

An Expert Overview of "LDB: A LLM Debugger via Verifying Runtime Execution Step by Step"

In this paper, the authors introduce a novel debugging framework named LLM Debugger (LDB) which aims to emulate human debugging practices for programs generated by LLMs. The primary innovation of LDB lies in its approach to incorporating runtime execution information to iteratively refine generated programs.

Key Contributions

  1. Introduction of LDB Framework: LDB is designed to provide a systematic approach to debugging by leveraging runtime execution traces. The framework segments the code into basic blocks and verifies each block against the task description in a step-by-step manner.
  2. Runtime Execution Information: LDB tracks the values of intermediate variables after each basic block of runtime execution. This mirrors the practical debugging procedures of human developers who set breakpoints and analyze intermediate states.
  3. Experimental Validation: Experiments conducted on multiple benchmarks—HumanEval, MBPP, and TransCoder—demonstrate that LDB consistently enhances the baseline performance by up to 9.8%. This improvement underscores LDB's efficacy across various LLM backbones including GPT-3.5, StarCoder, and CodeLlama.

Methodological Advances

Profiling

LDB performs profiling by collecting runtime execution information using a failed visible test case. The key steps include:

  • Execution Traces: By mapping each program to a control flow graph (CFG), LDB segments the execution trace into basic blocks.
  • Intermediate States: LDB determines the runtime values of variables at the end of each basic block, which are critical for debugging the code incrementally.

Debugging

With the execution trace and intermediate states in hand, LDB proceeds to the actual debugging:

  • Debugging Verdicts and Explanations: For each intermediate state, the framework queries the LLM to verify the correctness of the corresponding code block and provide explanations if any discrepancies are found.
  • Selective and Batch Debugging: To handle lengthy execution traces, LDB employs selective debugging and batches the queries, improving both the efficiency and efficacy of the process.

Regeneration

Using the debugging insights, LDB iteratively refines the program:

  • The intermediate states and task description are incorporated into the prompt to regenerate the refined program.
  • This iterative approach continues until the program passes all visible tests or reaches the maximum number of iterations.

Experimental Results

The authors validate the effectiveness of LDB across three benchmarks:

  • HumanEval: LDB achieved a 9.1% improvement over the baseline for text-to-code generation.
  • MBPP: Demonstrated a 8.4% enhancement, showcasing LDB's robustness in refining complex logic flows.
  • TransCoder: LDB achieved a 5.4% improvement, highlighting its utility in code translation tasks.

Moreover, LDB showed substantial improvements even when starting with programs generated by more advanced LLMs, such as GPT-4 and Reflexion, detecting subtle bugs overlooked by these models.

Theoretical and Practical Implications

Theoretically, LDB introduces a new paradigm in the landscape of debugging with LLMs by incorporating real-time execution feedback. The segmentation of code into basic blocks and the use of runtime information enables finer-grained analysis and correction of errors.

Practically, the implementation of LDB can significantly aid in various downstream applications requiring correct code generation, such as software development, automated coding assistance, and educational tools for programming. The framework's ability to refine generated code iteratively promises advancements in achieving higher levels of accuracy and reliability in automated code generation tasks.

Future Directions

Future developments could explore:

  • Integration with More Complex Runtime Environments: Extending LDB to handle more complex languages and runtime environments could broaden its applicability.
  • Enhanced Debugging Algorithms: Developing more sophisticated algorithms for debugging on different granularity levels of at the runtime could further optimize performance.
  • Scalability and Efficiency: Conducting further studies on the scalability and efficiency of LDB with larger and more diverse datasets could provide deeper insights and improvements.

Overall, the LDB framework represents a significant step forward in the field of debugging for LLM-generated code, providing both theoretical contributions and practical tools for enhancing the accuracy and robustness of automated code generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Alfred V Aho and Ravi Sethi Jeffrey D Ullman. 2015. ,“compilers-principles, techniques, and tools”, pearson education asia, 2007.
  3. Frances E Allen. 1970. Control flow analysis. ACM Sigplan Notices, 5(7):1–19.
  4. Glenn Ammons and James R Larus. 1998. Improving data-flow analysis with path profiles. In Proceedings of the ACM SIGPLAN 1998 conference on Programming language design and implementation, pages 72–84.
  5. Program synthesis with large language models. arXiv preprint arXiv:2108.07732.
  6. Thomas Ball and James R Larus. 1994. Optimally profiling and tracing programs. ACM Transactions on Programming Languages and Systems (TOPLAS), 16(4):1319–1360.
  7. Thomas Ball and James R Larus. 1996. Efficient path profiling. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 46–57. IEEE.
  8. Improving code generation by training with natural language feedback. arXiv preprint arXiv:2303.16749.
  9. Codet: Code generation with generated tests. In The Eleventh International Conference on Learning Representations.
  10. Tooldec: Syntax error-free and generalizable tool use for llms via finite-state decoding. In The 3rd Workshop on Mathematical Reasoning and AI at NeurIPS’23.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  12. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128.
  13. Execution-guided neural program synthesis. In International Conference on Learning Representations.
  14. Binding language models in symbolic languages. In The Eleventh International Conference on Learning Representations.
  15. John Cocke. 1970. Global common subexpression elimination. In Proceedings of a symposium on Compiler optimization, pages 20–24.
  16. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
  17. Automated repair of programs from large language models. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1469–1481. IEEE.
  18. Python Software Foundation. 2001. pdb — the python debugger. https://docs.python.org/3/library/pdb.html.
  19. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945.
  20. Qiuhan Gu. 2023. Llm-based code generation method for golang compiler testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 2201–2203.
  21. Grace: Language models meet code edits. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 1483–1495.
  22. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352.
  23. Tree-planner: Efficient close-loop task planning with large language models. arXiv preprint arXiv:2310.08582.
  24. Leveraging print debugging to improve code generation in large language models. arXiv preprint arXiv:2401.05319.
  25. Enhancing large language models in coding through multi-perspective self-consistency. arXiv preprint arXiv:2309.17272.
  26. Large language models cannot self-correct reasoning yet. arXiv preprint arXiv:2310.01798.
  27. An empirical study on fine-tuning large language models of code for automated program repair. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1162–1174. IEEE.
  28. Selfevolve: A code evolution framework via large language models. arXiv preprint arXiv:2306.02907.
  29. Inferfix: End-to-end program repair with llms. arXiv preprint arXiv:2303.07263.
  30. James R Larus. 1999. Whole program paths. ACM SIGPLAN Notices, 34(5):259–269.
  31. Codechain: Towards modular code generation through chain of self-revisions with representative sub-modules. arXiv preprint arXiv:2310.08992.
  32. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. Advances in Neural Information Processing Systems, 35:21314–21328.
  33. Code completion with neural attention and pointer networks. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, IJCAI’18, page 4159–25. AAAI Press.
  34. Starcoder: may the source be with you! arXiv preprint arXiv:2305.06161.
  35. Competition-level code generation with alphacode. Science, 378(6624):1092–1097.
  36. Let’s verify step by step. arXiv preprint arXiv:2305.20050.
  37. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
  38. Octopack: Instruction tuning code large language models. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  39. Learning math reasoning from self-sampled correct and partially-correct solutions. In The Eleventh International Conference on Learning Representations.
  40. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, pages 26106–26128. PMLR.
  41. Codegen: An open large language model for code with multi-turn program synthesis. In The Eleventh International Conference on Learning Representations.
  42. Examining zero-shot vulnerability repair with large language models. In 2023 IEEE Symposium on Security and Privacy (SP), pages 2339–2356. IEEE.
  43. Reese T Prosser. 1959. Applications of boolean matrices to the analysis of flow diagrams. In Papers presented at the December 1-3, 1959, eastern joint IRE-AIEE-ACM computer conference, pages 133–138.
  44. Codepori: Large scale model for autonomous software development by using multi-agents. arXiv preprint arXiv:2402.01411.
  45. Code completion with statistical language models. In Proceedings of the 35th ACM SIGPLAN conference on programming language design and implementation, pages 419–428.
  46. Trace cache: a low latency approach to high bandwidth instruction fetching. In Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, pages 24–34. IEEE.
  47. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  48. Unsupervised translation of programming languages. Advances in Neural Information Processing Systems, 33:20601–20611.
  49. Basic block distribution analysis to find periodic behavior and simulation points in applications. In Proceedings 2001 International Conference on Parallel Architectures and Compilation Techniques, pages 3–14. IEEE.
  50. Natural language to code translation with execution. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3533–3546.
  51. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  52. Debugging with gdb. Free Software Foundation, 675.
  53. Explain-then-translate: an analysis on improving program translation with self-generated explanations. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 1741–1788.
  54. Chain-of-table: Evolving tables in the reasoning chain for table understanding. arXiv preprint arXiv:2401.04398.
  55. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  56. Fine-grained human feedback gives better rewards for language model training. arXiv preprint arXiv:2306.01693.
  57. Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 440–450.
  58. Evaluating instruction-tuned large language models on code comprehension and generation. arXiv preprint arXiv:2308.01240.
  59. Algo: Synthesizing algorithmic programs with generated oracle verifiers. arXiv preprint arXiv:2305.14591.
  60. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pages 41832–41846. PMLR.
  61. A study on robustness and reliability of large language model code generation. arXiv preprint arXiv:2308.10335.
  62. Language agent tree search unifies reasoning acting and planning in language models. arXiv preprint arXiv:2310.04406.
  63. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. arXiv preprint arXiv:2308.07921.
  64. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Zilong Wang (99 papers)
  2. Jingbo Shang (141 papers)
  3. Li Zhong (16 papers)
Citations (24)
Youtube Logo Streamline Icon: https://streamlinehq.com