Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Print Debugging to Improve Code Generation in Large Language Models (2401.05319v1)

Published 10 Jan 2024 in cs.CL and cs.SE

Abstract: LLMs have made significant progress in code generation tasks, but their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. To address this issue, we propose an in-context learning approach that guides LLMs to debug by using a "print debugging" method, which involves inserting print statements to trace and analysing logs for fixing the bug. We collect a Leetcode problem dataset and evaluate our method using the Leetcode online judging system. Experiments with GPT-4 demonstrate the effectiveness of our approach, outperforming rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  3. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  4. Palm: Scaling language modeling with pathways, 2022.
  5. Selfevolve: A code evolution framework via large language models. arXiv preprint arXiv:2306.02907, 2023.
  6. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  7. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  8. Enabling programming thinking in large language models toward code generation. arXiv preprint arXiv:2305.06599, 2023a.
  9. Explaining competitive-level programming solutions using llms. arXiv preprint arXiv:2307.05337, 2023b.
  10. Think outside the code: Brainstorming boosts large language models in code generation. arXiv preprint arXiv:2305.10679, 2023c.
  11. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation. arXiv preprint arXiv:2305.01210, 2023.
  12. Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651, 2023.
  13. Demystifying gpt self-repair for code generation. arXiv preprint arXiv:2306.09896, 2023.
  14. OpenAI. Gpt-4 technical report, 2023.
  15. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  16. Extending the frontier of chatgpt: Code generation and debugging, 2023.
  17. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  18. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  19. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023.
  20. Large language model is semi-parametric reinforcement learning agent. arXiv preprint arXiv:2306.07929, 2023a.
  21. Self-edit: Fault-aware code editor for code generation. arXiv preprint arXiv:2305.04087, 2023b.
  22. Coder reviewer reranking for code generation. In International Conference on Machine Learning, pp.  41832–41846. PMLR, 2023c.
  23. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  24. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
Citations (8)

Summary

  • The paper introduces a novel in-context learning method that uses print debugging to guide LLMs in identifying and resolving coding errors.
  • The methodology involves inserting print statements to capture execution logs, enabling precise tracing of variable states and bug localization.
  • Experiments with GPT-4 on Leetcode problems showed that the approach outperforms rubber duck debugging, especially by 17.9% on medium-level tasks.

The paper introduces a novel in-context learning approach to enhance code generation in LLMs by guiding them to debug using a "print debugging" method. This technique involves inserting print statements to trace variables and analyze logs, thereby facilitating bug identification and resolution.

The authors collected a dataset of Leetcode problems to evaluate the proposed method, utilizing the Leetcode online judging system for assessment. Experiments conducted with GPT-4 demonstrated the effectiveness of the "print debugging" approach, which outperformed rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%, respectively.

The core idea revolves around emulating the print debugging practices employed by human programmers. The method leverages in-context learning to guide LLMs through the debugging process, which includes adding print statements, executing the modified code, and analyzing the generated logs to identify and fix bugs. This iterative process continues until the generated code passes all test cases or a predefined stopping criterion is met.

The paper details the methodology, which comprises the following steps:

  • Adding Print Statements: LLMs are prompted to insert print statements into the code to output variable values and trace the program's execution flow. Given a piece of code S=[s1,s2,...,si,...,sn]S = [s_1, s_2, ..., s_i, ..., s_n], where sis_i represents a line of code, the modified code with added print statements is represented as Sp=[p1,s1,p2,s2,...,pi,si,...,pn,sn,pn+1]S_p = [p_1, s_1, p_2, s_2, ..., p_i, s_i, ..., p_n, s_n, p_{n+1}], where pip_i denotes a potential print statement.
  • Execution: The code, including the added print statements, is executed using the failed test case. The output generated by the print statements is captured, along with any error messages provided by the interpreter.
  • Analyzing and Fixing: The LLMs are instructed to explain the test case and the generated log, comparing them to identify inconsistencies and pinpoint the buggy code.

The paper compares the proposed method against several baselines: simple feedback, unit test feedback, and rubber duck debugging. The experimental results indicate that the "print debugging" method is particularly effective for problems involving complex data structures and algorithms. While it showed marginal improvement in easy-level Leetcode problems, it significantly outperformed other methods in medium-level problems. However, for hard-level problems, none of the debugging methods yielded substantial improvements.

An ablation paper was conducted to assess the impact of different components of the proposed method. The results indicated that using both test case explanations and logs is crucial for effectively debugging the code. Removing either component resulted in a drop in performance.

Further analysis was performed to evaluate the effectiveness of the "print debugging" method. The results showed that, on average, LLMs add 2.51 print statements per debugging round. The generated logs typically comprised fewer than 20 lines, which is considered an appropriate length for effective analysis by current LLMs.

The authors collected programming problems from the Leetcode platform, which were categorized into three levels: easy, medium, and hard. The dataset includes problems released after September 2019. The solutions generated by the models were submitted to the Leetcode platform for evaluation, and the platform provided results on whether all test cases were passed.

The authors use gpt-4-32k for all their experiments. The model has access to all the test cases within their experimental setting. They employ one-shot prompting to guide the model, repeating the debugging procedure until either all test cases were passed or 3 consecutive rounds failed to yield any improvement. The temperature is set to 0 and the max tokens is 4096. Accuracy is used as the evaluation metric, representing the percentage of problems that successfully passed all test cases.

In summary, the key contributions of the paper are:

  • A novel approach that leverages LLMs to execute print debugging.
  • A new programming problems dataset comprising recent Leetcode questions across three difficulty levels.
  • Experiments with GPT-4 demonstrating that the proposed approach yields significant improvements compared to rubber duck debugging.
X Twitter Logo Streamline Icon: https://streamlinehq.com