Leveraging Print Debugging to Improve Code Generation in Large Language Models (2401.05319v1)

Published 10 Jan 2024 in cs.CL and cs.SE

Abstract: LLMs have made significant progress in code generation tasks, but their performance in tackling programming problems with complex data structures and algorithms remains suboptimal. To address this issue, we propose an in-context learning approach that guides LLMs to debug by using a "print debugging" method, which involves inserting print statements to trace and analysing logs for fixing the bug. We collect a Leetcode problem dataset and evaluate our method using the Leetcode online judging system. Experiments with GPT-4 demonstrate the effectiveness of our approach, outperforming rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%.

References (24)

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel in-context learning method that uses print debugging to guide LLMs in identifying and resolving coding errors.
The methodology involves inserting print statements to capture execution logs, enabling precise tracing of variable states and bug localization.
Experiments with GPT-4 on Leetcode problems showed that the approach outperforms rubber duck debugging, especially by 17.9% on medium-level tasks.

The paper introduces a novel in-context learning approach to enhance code generation in LLMs by guiding them to debug using a "print debugging" method. This technique involves inserting print statements to trace variables and analyze logs, thereby facilitating bug identification and resolution.

The authors collected a dataset of Leetcode problems to evaluate the proposed method, utilizing the Leetcode online judging system for assessment. Experiments conducted with GPT-4 demonstrated the effectiveness of the "print debugging" approach, which outperformed rubber duck debugging in easy and medium-level Leetcode problems by 1.5% and 17.9%, respectively.

The core idea revolves around emulating the print debugging practices employed by human programmers. The method leverages in-context learning to guide LLMs through the debugging process, which includes adding print statements, executing the modified code, and analyzing the generated logs to identify and fix bugs. This iterative process continues until the generated code passes all test cases or a predefined stopping criterion is met.

The paper details the methodology, which comprises the following steps:

Adding Print Statements: LLMs are prompted to insert print statements into the code to output variable values and trace the program's execution flow. Given a piece of code $S = [s_1, s_2, ..., s_i, ..., s_n]$ , where $s_i$ represents a line of code, the modified code with added print statements is represented as $S_p = [p_1, s_1, p_2, s_2, ..., p_i, s_i, ..., p_n, s_n, p_{n+1}]$ , where $p_i$ denotes a potential print statement.
Execution: The code, including the added print statements, is executed using the failed test case. The output generated by the print statements is captured, along with any error messages provided by the interpreter.
Analyzing and Fixing: The LLMs are instructed to explain the test case and the generated log, comparing them to identify inconsistencies and pinpoint the buggy code.

The paper compares the proposed method against several baselines: simple feedback, unit test feedback, and rubber duck debugging. The experimental results indicate that the "print debugging" method is particularly effective for problems involving complex data structures and algorithms. While it showed marginal improvement in easy-level Leetcode problems, it significantly outperformed other methods in medium-level problems. However, for hard-level problems, none of the debugging methods yielded substantial improvements.

An ablation paper was conducted to assess the impact of different components of the proposed method. The results indicated that using both test case explanations and logs is crucial for effectively debugging the code. Removing either component resulted in a drop in performance.

Further analysis was performed to evaluate the effectiveness of the "print debugging" method. The results showed that, on average, LLMs add 2.51 print statements per debugging round. The generated logs typically comprised fewer than 20 lines, which is considered an appropriate length for effective analysis by current LLMs.

The authors collected programming problems from the Leetcode platform, which were categorized into three levels: easy, medium, and hard. The dataset includes problems released after September 2019. The solutions generated by the models were submitted to the Leetcode platform for evaluation, and the platform provided results on whether all test cases were passed.

The authors use gpt-4-32k for all their experiments. The model has access to all the test cases within their experimental setting. They employ one-shot prompting to guide the model, repeating the debugging procedure until either all test cases were passed or 3 consecutive rounds failed to yield any improvement. The temperature is set to 0 and the max tokens is 4096. Accuracy is used as the evaluation metric, representing the percentage of problems that successfully passed all test cases.

In summary, the key contributions of the paper are:

A novel approach that leverages LLMs to execute print debugging.
A new programming problems dataset comprising recent Leetcode questions across three difficulty levels.
Experiments with GPT-4 demonstrating that the proposed approach yields significant improvements compared to rubber duck debugging.

PDF Markdown

Related Papers

Tweets

https://twitter.com/im_thatoneguy/status/1782459219226374285