ChatDBG: Dialogue-Based Debugging Assistant

Updated 20 September 2025

ChatDBG is an AI-augmented, dialogue-based debugging assistant that allows natural language queries to interact with conventional debuggers.
It integrates large language models into tools like GDB, LLDB, and Pdb, enabling automated stack frame navigation, program state inspection, and command execution.
Evaluations demonstrate high rates of actionable bug resolution, with improved fix rates through iterative dialogues in both native and interpreted code.

ChatDBG is an AI-augmented, dialogue-based debugging assistant that integrates LLMs into the control loop of conventional debuggers. Its core innovation is to allow programmers to engage in natural language dialogues about program state, root cause analysis, and open-ended diagnostics (e.g., “why is x null?”). The system delegates autonomous agency to the LLM: the model may issue debugger commands independently, navigate through stack frames, inspect program state, and report its findings in response to human queries. ChatDBG can articulate both step-by-step explanations and actionable bug fixes, drawing on the world knowledge encoded in LLMs and enriched program state assembled from conventional debugging contexts. The ChatDBG methodology has been realized as an extension to widely used debuggers—including LLDB, GDB, and Python’s Pdb—supporting both native and interpreted languages in static scripts and interactive Jupyter sessions. Evaluation demonstrates high rates of actionable bug resolution among real-world programs, and community adoption underscores its impact on contemporary debugging workflows (Levin et al., 2024).

1. System Architecture and Integration

ChatDBG augments traditional debuggers by inserting an “LLM agent” into the command-processing loop. Its runtime flow distinguishes between two classes of user input: standard debugger commands (e.g., “p num_trials”, “bt”) are routed directly to the debugger, while natural language queries (“why is var null?”, “what does stats mean?”) are flagged and forwarded to an LLM.

The LLM receives an enriched prompt constructed from multiple components:

I: contextual instructions (system prompt, debugging target)
S: current stack trace (with extended source code for frames and variable values)
U: user inputs and queries
E: most recent error messages or assertion failures
H: dialogue and execution history

This prompt $(P)$ is conceptually assembled as: $P = I \oplus S \oplus U \oplus E \oplus H$ where $\oplus$ denotes ordered concatenation of these sources of context.

During “autonomous” episodes, the LLM communicates with the underlying debugger via APIs or special function calls, such as debug("p len(stats)") or info(function_name). The system exploits function-calling capabilities in modern LLMs (e.g., OpenAI’s platform) to invoke these commands in a controlled sequence, with the LLM interleaving debug actions and explanatory text before yielding control to the human operator.

2. Autonomous Root Cause Analysis and Interaction

A salient advance in ChatDBG is its capacity for autonomous root-cause analysis. Upon encountering a query (such as “why doesn’t stats have 5 elements?”), the LLM interrogates program state:

Traverses stack frames and inspects local variables
Retrieves 10+ lines of source code per frame (beyond typical debugger snippets)
For Jupyter/IPython sessions, triggers backward slicing across cells (via tools like ipyflow) to reconstruct cross-cell dependencies and state provenance

The LLM can then iteratively request additional information, pose clarifying sub-queries (“show type(stats)”), or suggest and execute corrective actions. Throughout, ChatDBG maintains rich context, omitting library frames extraneous to user code, and surfaces relevant local and global variables for domain-specific reasoning.

Notably, ChatDBG can perform domain-aware diagnosis. In cases of bootstrapping failures, it may leverage statistical concepts (e.g., Law of Large Numbers) and best practices (e.g., sufficient number of trials in simulations), or detect domain-typical idioms (such as wrong reduction over arrays). This enables actionable feedback (e.g., “Replace the return value with np.mean(sample == 'B') for correct aggregation”).

3. Capabilities, User Experience, and Dialogue

Unlike traditional debuggers, which require step-by-step command entry, ChatDBG supports hybrid free-form and command-driven interaction:

Programmers pose natural language questions and receive explanatory answers, hypotheses about fault causes, and repair suggestions
Multi-turn dialogues permit scenario refinement (allowing compound or clarifying follow-up queries)
The LLM reasons over both immediate local state and the broader static/dynamic context of execution—including variable types, historical user input, and error tracebacks

Actionable fixes are often included in the response. For example, after identifying a logic error, ChatDBG may present a code replacement ready to be applied. If permission is granted by the user, the fix can be injected directly into the source and execution resumed.

4. Performance Evaluation and Impact

Quantitative experiments across C/C++ codebases (compiled with debug information) and Python programs were reported. For Python targets:

A single open-ended query led to correct, actionable fixes in 67% of tested cases
Allowing a follow-up query increased the fix rate to 85%

For C/C++ code, ChatDBG diagnosed true root causes in 36% of cases and immediate crash causes in an additional 55%, supporting debugging workflows in both managed and unmanaged runtime environments.

ChatDBG has demonstrated high adoption, with over 75,000 downloads, suggesting rapid community embrace and significant real-world relevance.

5. Enriched Context and Domain-Specific Reasoning

A central feature is the construction of enriched, context-aware prompts. Stack traces supplied to the LLM are not merely shallow call records; they aggregate larger source windows per frame, variable types, dynamic values, and filtered representations that omit less relevant library noise.

In interactive environments (e.g., Jupyter), ChatDBG exercises backwards slicing to track data flow and provenance across out-of-order cell execution. This capability is especially critical for reconstructing variable values and dependencies in interactive or educational settings, where code is often written and executed nonlinearly.

By composing context-rich prompts and supporting complex reasoning chains, ChatDBG is able to embed domain-specific interpretation directly into the debugging workflow, ensuring both high diagnostic precision and actionable insight.

6. Comparison with Existing Approaches and Generalization

ChatDBG’s primary differentiators are its:

Direct integration with GDB/LLDB/Pdb, maintaining full compatibility with established debugging workflows for both native and interpreted code
Function call-driven LLM autonomy, yielding a mixed-initiative control modality where machine and human collaboratively traverse program execution and diagnostics
Rich prompt architecture, which combines source code, state, historical interaction, and error context for highly informed LLM reasoning

Unlike systems that restrict LLMs to passive code or error message analysis, ChatDBG enables “agentive” LLM behaviors—dynamic state interrogation and command generation—under controlled execution.

Even as open-source adoption and quantitative results underscore its practical value, the architecture also highlights key open research questions: balancing LLM autonomy and oversight, scaling context window and stack frame summarization to large codebases, and extending domain-specific reasoning across new programming paradigms.

7. Outlook and Further Developments

ChatDBG represents an evolution in augmenting human debugging by tightly coupling LLMs with program state, execution context, and interactive control. Future work may focus on:

Deeper integration of program analysis techniques (dynamic data-flow, symbolic execution) to further enhance LLM reasoning fidelity
Adapting the framework to support new modalities (e.g., distributed or concurrent system debugging, as exemplified in the GoTcha approach (Achar et al., 2019))
Augmenting dialogue strategies with more granular user intent recognition, repair validation, and safety checks before code edits are applied automatically

This suggests a broader shift in debugging methodologies towards autonomous, context-sensitive, and dialogue-driven tooling, where LLMs serve as both diagnostic agents and collaborators. As quantitative evaluation across diverse codebases confirms strong performance, ChatDBG has established a benchmark for future work at the intersection of AI and program analysis.