BugScope: LLM-Driven Bug Detection
- BugScope is an LLM-driven system that emulates human bug audits by adapting dynamic retrieval strategies and program slicing for improved detection of software defects.
- The system integrates a Context Retrieval Agent and a Bug Detection Agent to synthesize custom detection prompts from code examples using few-shot learning.
- Quantitative evaluations show BugScope achieves superior precision and recall over industrial baselines, uncovering novel bugs in large-scale open-source projects.
BugScope is an LLM-driven multi-agent system for software bug detection that emulates human auditors by learning new bug patterns from representative example sets and applying that knowledge during code auditing. Departing from traditional static analysis, which relies on fixed symbolic workflows with limited adaptability to diverse real-world defects and anti-patterns, BugScope synthesizes a custom retrieval and detection pipeline for each bug class using dynamically constructed prompts and program slicing. Quantitative evaluation demonstrates superior precision and recall compared to industrial baselines and reveals substantial practical impact through novel bug discovery in large open-source projects (Guo et al., 21 Jul 2025).
1. System Architecture and Workflow
BugScope’s architecture operationalizes the human bug-auditing workflow—identifying suspicious code elements as “seeds,” retrieving salient context via program slicing, and applying deep semantic reasoning—by deploying two collaborating LLM-powered agents:
- Context Retrieval Agent: Responsible for example selection, retrieval-strategy synthesis, and program slicing (forward or backward).
- Bug Detection Agent: Composed of detection-prompt synthesis (few-shot with embedded reasoning hints), LLM-driven decision, and a validator for hallucination filtering.
The pipeline begins with ingestion of paired buggy (BE) and non-buggy (NE) code examples representing the target anti-pattern. These are analyzed by the LLM to extract: (a) the program construct(s) that serve as slicing seeds (faulty values or dangerous operands) (b) the appropriate slicing direction (forward or backward).
This yields a retrieval strategy expressed as a seed extractor and direction . The Context Retrieval Agent then applies program slicing on the AST and/or call graph, isolating minimal, self-contained code snippets for further analysis. The Bug Detection Agent synthesizes a tailored detection prompt —including chain-of-thought and reasoning hints—fed to the LLM for final bug decision and validation.
2. Formal Definitions and Algorithmic Constructs
BugScope’s operational semantics are formalized as follows:
2.1 Retrieval Strategy Synthesis
Given example set with labeled , the LLM induces
- a seed-classification function
- and retrieval direction , maximizing , where Score measures discrimination between BE and NE (implicitly learned via chain-of-thought prompts).
2.2 Program Slicing
Given PDG and slicing seed :
- Forward slice: path following
- Backward slice: path following Minimal subprogram inline () or inline (), call depth limited to .
2.3 Detection Prompt Construction
Detection prompt constructed by , where:
- few-shot examples
- reasoning hints (“check numeric range,” “track pointer aliases,” etc.)
- chain-of-thought instructions
Inference: , yielding label via LLM token probabilities.
3. Implementation Details and Processing Steps
The following pseudocode summarizes BugScope’s three canonical phases:
Retrieval-Strategy Synthesis
1 2 3 4 5 |
Prompt LLM with buggy/non-buggy code pairs and ask for: (a) bug-triggering variables/expressions (b) slicing direction (forward/backward) Parse LLM reply into S and d return (S, d) |
Context Extraction via Slicing
1 2 3 4 5 6 7 8 |
Parse F with Tree-sitter → AST + call-graph for each node n in F: if S(n) ≠ ⊥: slice_nodes = FS(n) if d == forward else BS(n) expand interprocedurally up to depth K C ← inline_and_simplify(slice_nodes) collect C return collected C |
Prompt Generation & Refinement
1 2 3 4 5 6 7 |
BasePrompt ← few-shot with (BE, NE) Insert H as guidance for i in 1..R: P′ ← LLM("Revise this prompt to better highlight anti-pattern", BasePrompt) if quality(P′) > quality(BasePrompt): BasePrompt ← P′ return BasePrompt |
Each snippet is concatenated into and fed to the LLM; candidate “Bug” results are re-validated before reporting.
4. Representative Anti-Patterns and Specialized Processing
BugScope’s strategy adapts fluidly across a range of anti-patterns, as illustrated by example code snippets:
| Anti-Pattern | Slice Seed & Direction | Reasoning Hints / Prompt Example |
|---|---|---|
| Oversized Offset (OSO) ‒ OOB | Faulty (back) | “check size vs buffer length” |
| Allocation Size Overflow (ASO) ‒ OOB | (forward) | “track integer wrap-around after multiplication” |
| Insufficient Zero Check (IZC) ‒ DBZ | Divisor (forward) | “check value-range >0, not just non-negative” |
| System-Specific (OOB & DBZ) | (forward) | “flag all uses of d->block[0]; buffer & division” |
The system generalizes to system-specific and compound bug classes, fusing multiple reasoning hints within flexible prompt templates.
5. Quantitative Evaluation and Comparative Analysis
BugScope achieves demonstrable superiority over industrial baselines on two axes: controlled benchmarks and real-world discovery.
- Forty Real-World Bugs (across 7 anti-patterns)
- Precision:
- Recall:
- : $0.88$
- Breakdown by anti-pattern
- OSO: ,
- NOF: ,
- ASO: ,
- IZC: ,
- LZD, UEC, MSC: all above precision/recall
| Tool | Precision | Recall | |
|---|---|---|---|
| BugScope | 87.04 % | 90.00 % | 0.88 |
| RepoAudit | 32.14 % | 42.50 % | 0.37 |
| Cursor BugBot | 71.43 % | 27.50 % | 0.40 |
| CodeRabbit | 76.92 % | 17.50 % | 0.29 |
| Meta Infer | 7.69 % | 2.50 % | 0.04 |
On large open-source projects, including the Linux kernel, BugScope uncovered 141 novel bugs, with 78 already fixed and 7 confirmed by developers as impactful.
6. Guarantees, Limitations, and Key Insights
No formal soundness or completeness guarantees are provided—consistent with Rice’s theorem on the undecidability of generic bug detection, BugScope prioritizes empirical coverage and adaptability. Central design insights:
- Generalization to novel anti-patterns arises from learning retrieval seeds and slicing strategies via few-shot examples, as opposed to hard-coded symbolic rules.
- Multi-agent separation of context retrieval and detection mitigates LLM hallucination: slicing delivers focused code regions, and prompt synthesis ensures contextually-relevant reasoning.
- Limiting interprocedural slice depth () balances recall, precision, and scalability, with evaluation showing robust recall (90%) on real bugs.
Limitations include:
- Dependence on exemplars: poor BE/NE selection may compromise retrieval strategy.
- Bounded slice expansion: deep or dynamic control-flow may elude discovery.
- Residual LLM hallucinations: a validation pass reduces, but does not entirely eliminate, this issue.
A plausible implication is that BugScope’s “learn to learn” methodology introduces an extensible paradigm for pattern-driven, example-centric bug discovery that is systematically adaptable to new domains, anti-patterns, and evolving codebases. This approach demonstrates that the human-like audit workflow—study examples, extract context, reason semantically—covers a broad spectrum of defects with high precision and recall (Guo et al., 21 Jul 2025).