Debugging Without Error Messages: How LLM Prompting Strategy Affects Programming Error Explanation Effectiveness (2501.05706v1)

Published 10 Jan 2025 in cs.SE and cs.HC

Abstract: Making errors is part of the programming process -- even for the most seasoned professionals. Novices in particular are bound to make many errors while learning. It is well known that traditional (compiler/interpreter) programming error messages have been less than helpful for many novices and can have effects such as being frustrating, containing confusing jargon, and being downright misleading. Recent work has found that LLMs can generate excellent error explanations, but that the effectiveness of these error messages heavily depends on whether the LLM has been provided with context -- typically the original source code where the problem occurred. Knowing that programming error messages can be misleading and/or contain that serves little-to-no use (particularly for novices) we explore the reverse: what happens when GPT-3.5 is prompted for error explanations on just the erroneous source code itself -- original compiler/interpreter produced error message excluded. We utilized various strategies to make more effective error explanations, including one-shot prompting and fine-tuning. We report the baseline results of how effective the error explanations are at providing feedback, as well as how various prompting strategies might improve the explanations' effectiveness. Our results can help educators by understanding how LLMs respond to such prompts that novices are bound to make, and hopefully lead to more effective use of Generative AI in the classroom.

Summary

The paper found that GPT-3.5 can explain programming errors effectively using only source code, achieving a 70.8% usefulness rate even without traditional error messages.
While prompting strategies like one-shot and fine-tuning didn't greatly boost accuracy, fine-tuning significantly improved explanation conciseness and relevance.
The findings imply LLMs can serve as valuable tools in programming education to bridge the gap from unclear error messages, especially when fine-tuned.

Debugging Without Error Messages: Influence of LLM Prompting Strategies on Programming Error Explanation

This paper explores an innovative approach for improving programming error explanations by leveraging the capabilities of LLMs, specifically GPT-3.5, in contexts where traditional error messages are omitted. The motivation stems from the well-documented shortcomings of standard programming error messages, which are often criticized as being terse, confusing, and laden with jargon that is particularly daunting for novice programmers.

Objective and Methodology

The core objective of the study is to assess the efficacy of GPT-3.5 in explaining programming errors using only the erroneous source code, devoid of the original compiler or interpreter error messages. This scenario addresses the critique that typical error messages may not contribute positively to understanding or resolving programming issues. The research employs a range of LLM prompting strategies such as one-shot prompting and fine-tuning, alongside a baseline approach, to generate and evaluate error explanations.

The researchers collected erroneous coding instances from the TigerJython dataset, which is composed of code examples that novices might create. They manually crafted explanations for a portion of these examples to use as benchmarks. The LLM was tasked to generate explanations for these erroneous code snippets under three conditions: using the baseline strategy, employing one-shot prompting with examples, and drawing from a fine-tuned model trained on manually curated examples.

Findings

The study reveals that even without the inclusion of compiler error messages, GPT-3.5 is capable of generating useful error explanations with a fairly high utility rate. It was found that approximately 70.8% of explanations produced without context-specific error messages were rated as useful (either instrumental or helpful). The presence of carefully crafted prompts, whether provided as examples in one-shot prompting or through a fine-tuned model, did little to significantly alter the accuracy of the explanations but had notable effects on reducing extraneous information. Notably, the fine-tuned model provided the most concise and relevant feedback, a result attributed to its training on focused examples of desirable output.

Implications

The paper’s insights provide substantial implications for the educational deployment of generative AI in programming education. First, it suggests a reconsideration of the reliance on traditional error messages in educational scenarios. The findings imply that LLMs can serve as an effective tool to bridge the understanding gap caused by vague or misleading error messages. This role of LLMs becomes particularly relevant in environments employing educational programming languages, where students may face atypical and less documented error cases.

Furthermore, the effectiveness and efficiency of feedback from fine-tuned models present a compelling case for customizing AI tools to fit specific educational settings. While the process of fine-tuning may require some initial investment in time and resources to gather diverse programming error instances, the improvement in output relevance and conciseness can significantly aid novice learners by minimizing extraneous cognitive load.

Future Directions

The paper opens several avenues for future research. Exploring the impact of more advanced LLM architectures and larger datasets could add depth to understanding the true potential and limitations of LLMs in educational contexts. Additionally, more empirical studies on actual student interactions with AI-generated explanations would offer valuable data on practical implications. This approach could yield insights into refining LLM responses to not only enhance learning outcomes but also foster greater learner independence in debugging.

In conclusion, the paper illustrates that while LLMs alone may not completely replace the pedagogic need for thoughtfully designed learning experiences, they offer promising prospects for enhancing the explanatory frameworks available to novice programmers, suggesting a pivotal role for AI in future educational landscapes.