DSCHECKER: API Misuse Detection & Repair
- DSCHECKER is an approach that integrates static API documentation and dynamic runtime data via LLM prompts to detect and repair API misuses in data science libraries.
- It employs a multi-input prompt strategy, combining code snippets, API directives, and variable data to diagnose errors and synthesize repair diff patches.
- Evaluations on real-world benchmarks show improved F1 scores and fix rates, demonstrating its potential for automated code error detection and repair.
DSCHECKER is an approach for automatic detection and repair of API misuses in popular data science libraries (such as NumPy, pandas, Matplotlib, scikit-learn, and seaborn) using LLMs. It systematically augments LLMs with both static specification knowledge (API directives drawn from documentation) and dynamic runtime context (data characteristics, such as variable types, shapes, and sample values) to reason about correct API usage and synthesize code repairs. The methodology is evaluated on benchmarks of real-world misuses and is further instantiated as an agentic LLM system that dynamically acquires required information via function calls, reflecting practical development settings where the necessary context is not provided upfront (Galappaththi et al., 29 Sep 2025).
1. Problem Definition and Motivation
The detection and repair of API misuses in data science libraries—termed “black libraries”—is a persistent source of subtle defects in analytical code due to the data-dependent and rapidly evolving nature of these APIs. Errors may arise from violations of documented constraints (API directives) or from inappropriate use given a particular data context (such as using a pandas transformation on columns containing only missing values when the imputation strategy is not set to “constant”). Existing rule-based static analyzers rarely capture these complex, context-dependent failure modes, making the problem a compelling target for LLM-based approaches that can integrate heterogeneous sources of code and runtime knowledge.
DSCHECKER aims to automatically identify when an API is misapplied and suggest a patch, leveraging LLMs’ ability to process code and natural language alongside structured documentation and observations about program state. Its innovation lies in the explicit and configurable fusion of API specification with introspected data, significantly improving the capacity to detect errors whose manifestation is tied to the current variable state (Galappaththi et al., 29 Sep 2025).
2. Methodology: Prompt Engineering and Context Fusion
The DSCHECKER workflow is anchored by carefully structured LLM prompts that synthesize three inputs:
- Code snippet: The candidate source code region (including the potentially misused API).
- API directives: Quotations or summaries of relevant documentation passages, encapsulating required constraints and behavioral guarantees (e.g., "Columns which only contained missing values at fit are discarded upon transform if strategy is not ‘constant’").
- Data information: Structured representations of dynamic variables, such as types, shapes, head of DataFrames, or the dtype of NumPy arrays.
The LLM is prompted to answer using a JSON schema, outputting a "correct" flag, a detailed explanation, and a unified diff patch to repair detected misuse if necessary.
Several ablation prompt templates were assessed:
- base: Supplies only the code snippet.
- data: Code plus dynamic variable introspection.
- dir: Code plus API directive(s).
- full: Code, API directives, and dynamic data.
This decomposition enables systematic analysis of how specification and runtime context differentially contribute to correct identification and synthesis of repairs for misuses.
Detection is functionally defined as the LLM flagging the code as "correct": "no", accompanied by a plausible explanation for the misuse; successful repair is measured by application of the diff patch yielding correct behavior as validated by execution.
The process is further formalized for evaluation using standard metrics: where the LLM’s output is mapped to the binary “correct” flag and diff patch, and validated against gold standards for both detection and repair.
3. Experiments: Benchmarking Detection and Repair Performance
DSCHECKER was evaluated using a benchmark dataset originally compiled by Galappaththi et al., consisting of 38 reproducible API misuses from five libraries (NumPy, pandas, Matplotlib, scikit-learn, seaborn), with both faulty and corrected code variants (76 snippets total). Both zero-shot and few-shot LLM prompting regimes were studied; models tested included two versions of OpenAI’s GPT-4 and Meta’s Llama-3.1-405b-Instruct.
Performance on detection and repair was measured in F₁-score and fix rate (fraction of misuses successfully remedied and validated by execution):
| Setting / Prompt | Best Detection F₁ (%) | Best Fix Rate (%) |
|---|---|---|
| full | 61.18 | 51.28 |
| dir-only | lower | lower |
| data-only | lower | lower |
Incorporating both API directives and relevant data increased recall by ∼7% and fix rates by ∼5% over minimal context prompts, and outperformed using API documentation or data in isolation. The few-shot regime sometimes increased performance further for some LLMs.
Additional experiments extended to deep learning library APIs (such as TensorFlow), showing that while DL misuse detection is more challenging (F₁ = 48%, fix rate ≈ 27%), DSCHECKER achieved notably higher recall and repair than prior LLM-based baselines that had previously demonstrated low recall and negligible repair.
4. DSCHECKER Agent: Adaptive Function Calling and Real-World Application
To reflect realistic tool use where not all required context is available upfront, DSCHECKER implements an agentic architecture in which the LLM dynamically invokes function calls to obtain context as needed:
- get_variable_info: Statistically analyzes the AST and injects runtime code to extract dynamic properties (types, shapes, head/first rows, etc.).
- get_api_documentation: Queries a local documentation corpus to retrieve the full API docstring for relevant API calls.
In this “agentic” setting, the LLM decides when and what auxiliary information to request, then integrates additional context before issuing a final judgement on correctness and proposing a repair. On the same benchmark, the DSCHECKER agent obtained a detection F₁ of 48.65% and a fix rate of 39.47%, with ∼1–2 function calls per code snippet on average.
This suggests some degradation relative to the idealized "full" prompt, plausibly due to increased uncertainty or partial context, but underscores the system’s practicality for semi-automated review workflows.
5. Impact, Limitations, and Directions for Improvement
The experimental results demonstrate that:
- Explicit provision of (or on-demand access to) both API usage specifications and variable data information substantially enhances LLM-based API misuse detection and repair versus static code or documentation alone.
- LLMs achieve meaningful precision and recall using these prompts in data-centric “black library” settings, with the best model attaining a detection F₁ over 61% and an automatic fix rate above 51% on the core benchmark.
- The methodology generalizes, though with reduced absolute accuracy, to deep learning API misuse detection and repair, outperforming prior LLM-based approaches from the same authors by significant recall and fix rate margins.
The principal limitations identified include:
- Dependency on quality and granularity of API documentation (the need for concise, machine-consumable directives).
- Occasional LLM hallucinations, including unnecessary variable or documentation queries, particularly in the agentic mode.
- Degradation of performance in the absence of direct context or with solely agent-driven context gathering, indicating sensitivity to the completeness of information provision.
Proposed future enhancements encompass improving the agent’s orchestration and feedback control, better constraining or guiding LLM context retrieval, and encouraging software library authors to supply machine-readable, structured directives to facilitate more effective automated analysis.
6. Broader Significance and Applications
DSCHECKER’s framework points to a scalable approach for LLM-powered program analysis tools, especially for those libraries where correct usage is data-dependent and poorly captured by traditional static analysis. By dynamically fusing API directive knowledge and live variable data, it enables both automated defect discovery and patch synthesis. The methodology is particularly relevant as data science code bases grow in size and complexity, making static linting and testing alone increasingly insufficient for guaranteeing reliability.
A plausible implication is that further generalization of these context-augmentation and agentic strategies may extend the approach to other classes of program analysis and repair tasks, such as config file validation, pipeline assembly, or inter-library compatibility checking, especially where correct usage is contingent on runtime state.
Key Reference:
DSCHECKER: Detecting and Fixing API Misuses of Data Science Libraries Using LLMs (Galappaththi et al., 29 Sep 2025)