CogniGent: LLM-Driven Bug Localization

Updated 25 January 2026

CogniGent is an agentic, LLM-powered framework for automated bug localization that integrates causal reasoning with disciplined context management.
It deploys a modular pipeline of seven AI agents performing hypothesis-driven exploration, call-chain analysis, and multi-stage validation to achieve superior MAP and MRR scores.
The framework uses the Click2Cause DFS algorithm to conduct causal root-cause analysis while efficiently managing context through scratchpad isolation.

CogniGent is an agentic, LLM-powered framework for automated bug localization in software systems, distinguished by its explicit incorporation of causal reasoning, call-graph-driven dynamic cognition, and disciplined context management. Unlike conventional information retrieval (IR) or LLM-only approaches—which typically assess code components in isolation or with superficial similarity matching—CogniGent emulates human debugging with a pipeline of specialized agents for hypothesis-driven exploration, call-chain root-cause analysis, and multi-stage validation. Empirical results show statistically significant improvements in localization accuracy, measured by MAP and MRR, across multiple Java codebases when compared to six strong baselines (Samir et al., 18 Jan 2026).

1. Modular Agent Architecture and Execution Flow

CogniGent comprises seven interconnected AI agents—each powered by a dedicated LLM instance and orchestrated via a shared state graph (“LangGraph”). These agents are partitioned into context engineering, dynamic cognitive debugging, and final validation phases:

Context Engineering Agents
- Restructuring Agent: Sanitizes bug reports by excising irrelevant text while preserving coherence.
- Retrieval Agent: Integrates BM25 with a Neo4j+Lucene index to select the top 100 potentially relevant code segments for the exact codebase version.
- Filtering Agent: Employs Intelligent Relevance Feedback (IRF) using an LLM to prune candidates to ~10 by reasoning over symptom–code alignment.
Dynamic Cognitive Debugging Agents
- Hypothesis Agent: Generates for each segment a causal hypothesis, confidence category (high/medium/low), and numerical score in [0,1]. Only non-low confidence segments proceed.
- Supervisor Agent: Implements a ReAct-style protocol to decide, for each segment, whether the current evidence is sufficient; if not, selects which outgoing calls to traverse and launches an Explorer Agent.
- Explorer Agent: Executes depth-first traversal (DFS) over the call graph, powered by the Click2Cause algorithm. Maintains a “scratchpad” context restricted to the DFS path, assigns confidence, and backtracks/prunes branches where confidence drops.
Observer Agent: Validates all discovered call chains and isolated segments, scoring them against their causal hypotheses and accumulated evidence.
Final Candidate Ranking: Aggregates supervisor and observer scores to produce method- and document-level top-K rankings.

This orchestrated workflow enables CogniGent to simulate dynamic cognitive debugging practices structurally analogous to human reasoning workflows.

2. Causal Root Cause Analysis: Click2Cause Algorithm

The framework’s causal reasoning core is the Click2Cause algorithm, encapsulated in the Explorer Agent. This algorithm employs LLM-guided DFS over the call graph, selectively following method calls and pruning upon evidence of diminished causal alignment.

Click2Cause Pseudocode:

Algorithm Click2Cause(bugReport, startSeg, callsToExplore, maxDepth, τ, C_parent):
    visited ← ∅
    scratchPad ← ∅
    bestChain C* ← (∅, 0)
    for each call ∈ callsToExplore do
        DFS(startSeg, [call], 1, C_parent, bugReport)
    return C*

Procedure DFS(seg, path, depth, C_parent, bugReport):
    if seg ∈ visited or depth > maxDepth then return
    add seg to visited, path, scratchPad
    A ← LLMReason(bugReport, path)      // “think” step
    C_llm ← A.confidence                // assign new confidence
    if C_llm < C_parent then            // prune weak branch
        remove seg from scratchPad
        return
    if C_llm > C*(score) then           // update best chain
        C* ← (path, C_llm)
    if C_llm ≥ τ then                   // early stop high-confidence
        return
    for each nextCall ∈ A.callsToExplore do
        DFS(nextCall, path + nextCall, depth+1, C_llm, bugReport)
    return

The confidence threshold

\tau

(e.g., 0.90) ensures efficient, evidence-driven traversal resembling human backtracking. Only the most promising causal paths are maintained; context pollution is mitigated by localized scratchpad management.

Performance Metrics:

Mean Average Precision (MAP):

$\mathrm{MAP} = \frac{1}{Q} \sum_{q=1}^Q \mathrm{AP}(q), \quad \mathrm{AP}(q) = \frac{1}{R_q} \sum_{k=1}^N P(q, k)\, \mathrm{rel}(q, k)$

Mean Reciprocal Rank (MRR):

$\mathrm{MRR} = \frac{1}{Q} \sum_{q=1}^Q \frac{1}{\mathrm{rank}_q}$

where $Q$ is the number of bug queries, $R_q$ is the number of gold buggy items per query, and $\mathrm{rel}(q, k)$ is binary relevance.

3. Context Engineering and Prompting Strategies

CogniGent implements disciplined agent context management, necessary for robust LLM reasoning in deep codebases:

Scratchpad Contexts: Each Explorer Agent maintains only its active DFS path and minimal bug report excerpts; abandoned branches have context dropped.
Meta-Prompting & Chain-of-Thought: System prompts are constructed via catalogued patterns and refined using LLM-powered meta-prompting. Hypothesis and Supervisor Agents use chain-of-thought exemplars, guiding hypothesis generation, confidence assignment, and traversal decision-making.

Prompt Example (Hypothesis Agent):

Given this bug description and the following 10 code-segment snippets, generate for each: 1. A concise causal hypothesis 2. A confidence level {High, Medium, Low} 3. A numerical score 0.0–1.0. Use chain-of-thought to show your reasoning.

This context engineering directly aids scalability and prevents context overflow seen in prior LLM or agentic approaches.

4. Empirical Evaluation and Baseline Comparisons

The evaluation leverages a dataset of 591 bug reports spanning 132 versions of 15 major Java systems across Apache, Spring, and Wildfly projects. Bug reports are stratified by type: natural language (NL), program element (PE), and stack trace (ST).

Granularity: Localization is measured both at the method/constructor-level and document-level (whole files).
Baselines: Six established methods, including Lucene-only IR, BLUiR, BLIZZARD, BRaIn, Agentless LLM, and LocAgent.

Key Results (Devstral 20B CogniGent vs. Lucene IR):

Level	MAP_baseline	MAP_CogniGent	% Improvement
Document	0.330	0.407	+23.33%
Method	0.163	0.226	+38.57%

Level	MRR_baseline	MRR_CogniGent	% Improvement
Document	0.334	0.418	+25.14%
Method	0.165	0.254	+53.74%

Statistical testing via the Wilcoxon signed-rank method against BLIZZARD and LocAgent yields $p < 0.01$ for Top-K rank improvements, with Cliff’s $\delta$ indicating effect sizes from small–medium to large.

5. Analytical Discussion and Limitations

CogniGent’s gains over IR and prior LLM baselines are attributed to several core principles:

Hypothesis-Driven Exploration: Systematically forming, validating, and refining hypotheses provides explicit reasoning about causal links, not available via token-based or static graph walks.
Call-Graph Causal Reasoning: DFS traversal across code dependencies uncovers latent failure propagation not accessible to surface-level IR or naive BFS graph methods.
Focused Context Management: Scratchpad isolation prevents LLM cognitive overload, preserving precision even under deep call-chain explorations.

Limitations:

Language Scope: Currently limited to Java. Adapting to other languages necessitates re-engineering of code indexing and parsing components.
Computational Cost: Devstral model attains a cost of $\sim\$0.0026 $and runtime$ \sim$3 min per report; scaling optimizations may be required for very large datasets.
Structural Depth: Present call-graph analysis could benefit from richer program analysis, such as data-flow slicing, symbolic execution, or empirical runtime spectra.

6. Future Directions and Potential Extensions

Planned and plausible extensions include:

Multi-language support: Integration of flexible code parsers and cross-language code segment retrieval for targeting diverse codebases.
Scaling Strategies: Employing model distillation, quantization, or efficient retrieval to accommodate industrial-scale datasets.
Semantic Enrichment: Augmentation with symbolic execution, trace spectra, or data-flow analysis for enhanced hypothesis generation.
Hybrid Reasoning: Combining agentic LLM reasoning with program-analysis techniques for improved interpretability and transparency.

CogniGent represents an overview of dynamic cognitive debugging inspired by expert developer workflows with agentic LLM automation, yielding high-confidence, statistically valid improvements in bug localization accuracy and laying groundwork for scalable, causal reasoning in automated software maintenance (Samir et al., 18 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Improved Bug Localization with AI Agents Leveraging Hypothesis and Dynamic Cognition (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CogniGent.