Papers
Topics
Authors
Recent
Search
2000 character limit reached

BugScope: LLM-Driven Bug Detection

Updated 29 January 2026
  • BugScope is an LLM-driven system that emulates human bug audits by adapting dynamic retrieval strategies and program slicing for improved detection of software defects.
  • The system integrates a Context Retrieval Agent and a Bug Detection Agent to synthesize custom detection prompts from code examples using few-shot learning.
  • Quantitative evaluations show BugScope achieves superior precision and recall over industrial baselines, uncovering novel bugs in large-scale open-source projects.

BugScope is an LLM-driven multi-agent system for software bug detection that emulates human auditors by learning new bug patterns from representative example sets and applying that knowledge during code auditing. Departing from traditional static analysis, which relies on fixed symbolic workflows with limited adaptability to diverse real-world defects and anti-patterns, BugScope synthesizes a custom retrieval and detection pipeline for each bug class using dynamically constructed prompts and program slicing. Quantitative evaluation demonstrates superior precision and recall compared to industrial baselines and reveals substantial practical impact through novel bug discovery in large open-source projects (Guo et al., 21 Jul 2025).

1. System Architecture and Workflow

BugScope’s architecture operationalizes the human bug-auditing workflow—identifying suspicious code elements as “seeds,” retrieving salient context via program slicing, and applying deep semantic reasoning—by deploying two collaborating LLM-powered agents:

  • Context Retrieval Agent: Responsible for example selection, retrieval-strategy synthesis, and program slicing (forward or backward).
  • Bug Detection Agent: Composed of detection-prompt synthesis (few-shot with embedded reasoning hints), LLM-driven decision, and a validator for hallucination filtering.

The pipeline begins with ingestion of paired buggy (BE) and non-buggy (NE) code examples representing the target anti-pattern. These are analyzed by the LLM to extract: (a) the program construct(s) that serve as slicing seeds (faulty values or dangerous operands) (b) the appropriate slicing direction (forward or backward).

This yields a retrieval strategy expressed as a seed extractor S()S(\cdot) and direction d{forward,backward}d\in\{\text{forward}, \text{backward}\}. The Context Retrieval Agent then applies program slicing on the AST and/or call graph, isolating minimal, self-contained code snippets for further analysis. The Bug Detection Agent synthesizes a tailored detection prompt TT—including chain-of-thought and reasoning hints—fed to the LLM for final bug decision and validation.

2. Formal Definitions and Algorithmic Constructs

BugScope’s operational semantics are formalized as follows:

2.1 Retrieval Strategy Synthesis

Given example set E={(ci,yi)}E = \{(c_i, y_i)\} with cic_i labeled yi{buggy,non-buggy}y_i\in\{\text{buggy}, \text{non-buggy}\}, the LLM induces

  • a seed-classification function S:AST{FAULTY,DANGEROUS,}S: AST \to \{\text{FAULTY}, \text{DANGEROUS}, \bot\}
  • and retrieval direction d{,}d\in\{\rightarrow, \leftarrow\}, maximizing (c,y)EScore((S,d),c,y)\sum_{(c,y)\in E}\text{Score}((S', d'), c, y), where Score measures discrimination between BE and NE (implicitly learned via chain-of-thought prompts).

2.2 Program Slicing

Given PDG (V,EdataEctrl)(V, E_{\text{data}}\cup E_{\text{ctrl}}) and slicing seed sVs\in V:

  • Forward slice: FS(s)={vVFS(s)=\{v\in V\mid\exists path svs\to^* v following EdataEctrl}E_{\text{data}}\cup E_{\text{ctrl}}\}
  • Backward slice: BS(s)={vVBS(s)=\{v\in V\mid\exists path vsv\to^* s following EdataEctrl}E_{\text{data}}\cup E_{\text{ctrl}}\} Minimal subprogram C=C= inline (FS(s)FS(s)) or inline (BS(s)BS(s)), call depth limited to K=3K=3.

2.3 Detection Prompt Construction

Detection prompt PP constructed by BuildPrompt(Ex,H,COT)\mathrm{BuildPrompt}(\mathrm{Ex}, H, \mathrm{COT}), where:

  • Ex=\mathrm{Ex}= few-shot examples
  • H=H= reasoning hints (“check numeric range,” “track pointer aliases,” etc.)
  • COT=\mathrm{COT}= chain-of-thought instructions

Inference: P(yC)=LLM(P,C)P(y\mid C) = \mathrm{LLM}(P,\, C), yielding label y{Bug,No Bug}y\in\{\text{Bug}, \text{No Bug}\} via LLM token probabilities.

3. Implementation Details and Processing Steps

The following pseudocode summarizes BugScope’s three canonical phases:

Retrieval-Strategy Synthesis

1
2
3
4
5
Prompt LLM with buggy/non-buggy code pairs and ask for:
  (a) bug-triggering variables/expressions
  (b) slicing direction (forward/backward)
Parse LLM reply into S and d
return (S, d)

Context Extraction via Slicing

1
2
3
4
5
6
7
8
Parse F with Tree-sitter  AST + call-graph
for each node n in F:
    if S(n)  :
        slice_nodes = FS(n) if d == forward else BS(n)
        expand interprocedurally up to depth K
        C  inline_and_simplify(slice_nodes)
        collect C
return collected C

Prompt Generation & Refinement

1
2
3
4
5
6
7
BasePrompt  few-shot with (BE, NE)
Insert H as guidance
for i in 1..R:
    P  LLM("Revise this prompt to better highlight anti-pattern", BasePrompt)
    if quality(P) > quality(BasePrompt):
        BasePrompt  P
return BasePrompt

Each snippet CC is concatenated into PP and fed to the LLM; candidate “Bug” results are re-validated before reporting.

4. Representative Anti-Patterns and Specialized Processing

BugScope’s strategy adapts fluidly across a range of anti-patterns, as illustrated by example code snippets:

Anti-Pattern Slice Seed & Direction Reasoning Hints / Prompt Example
Oversized Offset (OSO) ‒ OOB Faulty sizesize (back) “check size vs buffer length”
Allocation Size Overflow (ASO) ‒ OOB nsizeof(int)n*\mathrm{sizeof(int)} (forward) “track integer wrap-around after multiplication”
Insufficient Zero Check (IZC) ‒ DBZ Divisor zz (forward) “check value-range >0, not just non-negative”
System-Specific (OOB & DBZ) dd (forward) “flag all uses of d->block[0]; buffer & division”

The system generalizes to system-specific and compound bug classes, fusing multiple reasoning hints within flexible prompt templates.

5. Quantitative Evaluation and Comparative Analysis

BugScope achieves demonstrable superiority over industrial baselines on two axes: controlled benchmarks and real-world discovery.

  • Forty Real-World Bugs (across 7 anti-patterns)
    • Precision: 87.04%87.04\%
    • Recall: 90.00%90.00\%
    • F1F_1: $0.88$
  • Breakdown by anti-pattern
    • OSO: P=66.7%P=66.7\%, R=80%R=80\%
    • NOF: P=77.8%P=77.8\%, R=100%R=100\%
    • ASO: P=100%P=100\%, R=100%R=100\%
    • IZC: P=87.5%P=87.5\%, R=66.7%R=66.7\%
    • LZD, UEC, MSC: all above 85%85\% precision/recall
Tool Precision Recall F1F_1
BugScope 87.04 % 90.00 % 0.88
RepoAudit 32.14 % 42.50 % 0.37
Cursor BugBot 71.43 % 27.50 % 0.40
CodeRabbit 76.92 % 17.50 % 0.29
Meta Infer 7.69 % 2.50 % 0.04

On large open-source projects, including the Linux kernel, BugScope uncovered 141 novel bugs, with 78 already fixed and 7 confirmed by developers as impactful.

6. Guarantees, Limitations, and Key Insights

No formal soundness or completeness guarantees are provided—consistent with Rice’s theorem on the undecidability of generic bug detection, BugScope prioritizes empirical coverage and adaptability. Central design insights:

  • Generalization to novel anti-patterns arises from learning retrieval seeds and slicing strategies via few-shot examples, as opposed to hard-coded symbolic rules.
  • Multi-agent separation of context retrieval and detection mitigates LLM hallucination: slicing delivers focused code regions, and prompt synthesis ensures contextually-relevant reasoning.
  • Limiting interprocedural slice depth (K=3K=3) balances recall, precision, and scalability, with evaluation showing robust recall (90%) on real bugs.

Limitations include:

  • Dependence on exemplars: poor BE/NE selection may compromise retrieval strategy.
  • Bounded slice expansion: deep or dynamic control-flow may elude discovery.
  • Residual LLM hallucinations: a validation pass reduces, but does not entirely eliminate, this issue.

A plausible implication is that BugScope’s “learn to learn” methodology introduces an extensible paradigm for pattern-driven, example-centric bug discovery that is systematically adaptable to new domains, anti-patterns, and evolving codebases. This approach demonstrates that the human-like audit workflow—study examples, extract context, reason semantically—covers a broad spectrum of defects with high precision and recall (Guo et al., 21 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BugScope.