SWE Context Bench for Software Agents

Updated 15 February 2026

SWE Context Bench is a benchmark suite that rigorously defines gold contexts to evaluate LLM-based agents in resolving real-world coding issues.
It logs granular actions and employs metrics like recall, precision, and efficiency to diagnose the retrieval process in automated software engineering.
The framework also supports experience reuse evaluation by comparing full versus summarized context reuse, revealing trade-offs in patch generation.

SWE Context Bench, in the context of automated software engineering, refers to a rigorous, repository-scale benchmark suite and framework designed to dissect and evaluate how coding agents—especially those powered by modern LLMs—retrieve, assemble, and leverage relevant code context during complex issue resolution. Rather than limiting evaluation to binary task outcomes, SWE Context Bench introduces human-verified “gold contexts” as granular intermediate signals and enables process-oriented analysis of context retrieval, experience reuse, and efficiency. The benchmark family encompasses datasets, action-logging protocols, and metrics for recall, precision, efficiency, and transfer, establishing a new standard for agent and LLM diagnostic evaluation in realistic software workflows (Li et al., 5 Feb 2026, Zhu et al., 9 Feb 2026).

1. Dataset Scope, Gold Contexts, and Annotation Protocols

SWE Context Bench encompasses 1,136 real-world issue-resolution tasks drawn from 66 open-source repositories spanning eight programming languages (Python, Java, JavaScript, TypeScript, Go, Rust, C, C++). The underlying dataset includes 4,548 files, 23,116 AST-level blocks, and 522,115 lines, each line range or block meticulously labeled as “gold context”—meaning it is strictly required to reconstruct a passing bug fix.

Gold context annotation follows an iterative, dependency-tracing workflow. Experts begin from the ground-truth patch and exhaustively trace function calls, class references, data/control dependencies, and definitions until only the minimal sufficient set of code regions remains. Each candidate context is verified by conditioning a state-of-the-art LLM (e.g., GPT-5) on these spans alone; if the model cannot generate a correct patch passing the designated suite, human annotators refine the context in up to two more rounds. All such spans are compacted to eliminate redundancy and maximize informativeness (Li et al., 5 Feb 2026).

2. Process-Oriented Evaluation Framework and Retrieval Metrics

A distinguishing feature of SWE Context Bench is its process-oriented evaluation harness, which instruments agents to log every context access—specifically each “file-view” or “code-print” action—at every step of the agent’s trajectory. The agent’s final declared context $C^A$ is then compared against the gold $C^G$ at three granularities: file paths, AST-blocks, and line spans, using Tree-Sitter–based parsing and alignment.

Three principal metrics are defined by set-overlap:

Context Recall

$R = \frac{\lvert C^A \cap C^G \rvert}{\lvert C^G \rvert}$

The proportion of gold context retrieved by the agent.

Context Precision

$P = \frac{\lvert C^A \cap C^G \rvert}{\lvert C^A \rvert}$

The purity of the agent’s retrieval, reflecting the amount of noise or irrelevant context included.

Retrieval Efficiency

$E = \frac{\lvert C^A \rvert}{\text{total tokens processed}}$

The ratio of retrieved context size to the agent’s overall context token budget.

The harness further supports temporal analysis: by tracking the union of all prior context accesses up to step $t$ (denoted $A^{(t)} = \bigcup_{i \leq t} C^A_i$ ), the area-under-coverage (AUC-Cov) can be computed. Additional diagnostics include the fraction of relevant context observed but ultimately not carried forward (“evidence-drop”) and per-step redundancy (Li et al., 5 Feb 2026).

3. Experimental Findings and Interpretation

SWE Context Bench has been employed to evaluate both agent frameworks (mini-SWE-agent, OpenHands, SWE-agent, Prometheus, Agentless) and backbone LLMs (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Pro, Devstral 2).

Key findings include:

Recall–Precision Trade-off: All evaluated systems exhibit high recall but comparatively low precision; for instance, GPT-5 and Prometheus achieve line-level recall above 0.60 but seldom surpass 0.35 precision, yielding F1 scores below 0.40.
Marginal Returns from Sophisticated Scaffolding: Advanced agents integrating embedding-based search, graph navigation, or specialized file tools do not consistently outperform the basic mini-SWE-agent baseline. This recapitulates the “Bitter Lesson” in AI: simple, human-inspired heuristics coupled with robust LLMs can match more ornate orchestration layers, as advanced agents tend to merely rediscover context available via straightforward shell-style search (Li et al., 5 Feb 2026).
Consolidation Gap (“Evidence Drop”): Even when agents access most or all relevant code during their trajectory (AUC-Cov > 0.70), only 50–70% of this evidence is retained in their final context. This retrieval–utilization gap results in agents failing to condition patch generation on critical lines despite having “seen” them, explaining many test failures.

4. Extensions: Experience Reuse and Efficiency in SWE-ContextBench

The SWE-ContextBench extension (Zhu et al., 9 Feb 2026) formalizes task-sequence structure and experience reuse. It augments 300 base tasks from SWE-Bench Lite with 99 related tasks derived from real dependency graphs (e.g., multi-issue resolutions, PR↔issue references, chains), establishing formal sequences $S = (T_1, ..., T_n)$ with explicit (i→j) dependency edges.

Five agent experience-retrieval paradigms are benchmarked:

No Experience: Baseline, no access to past runs.
Free Experience Reuse: Agent autonomously retrieves full prior trajectories.
Oracle Experience Reuse: Perfect prior trajectory handoff along ground-truth dependencies.
Free Summary Reuse: Agent picks from a pool of compact prior summaries.
Oracle Summary Reuse: Agent furnished with the correct concise summary for the dependency.

Metrics are multi-faceted:

Task-level resolution accuracy
Wall-clock runtime and percent speedup
Token cost reductions

Key empirical results:

Mode	Resolution Rate (%)	Avg. Runtime (s)	Avg. API Cost ($)
No Experience	26.26	381.95	0.79
Oracle Summary Reuse	34.34	356.95	0.77
Oracle Experience	27.27	—	—
Free Summary Reuse	22.22	—	0.98

Only “oracle” summary provision reliably boosts both accuracy and efficiency. Autonomous or mismatched retrieval (free search or summary) is often detrimental—misleading or irrelevant memories dilute the agent’s reasoning. Summarized trajectories are more effective than full ones, confirming that information bottlenecking is essential for reliable knowledge reuse (Zhu et al., 9 Feb 2026).

5. Diagnostic and Practical Implications for Software Engineering Agents

The granularity and temporal resolution of SWE Context Bench metrics enable direct integration into the software development process:

Agents can be regularized or stopped based on intermediate recall/precision thresholds, e.g., halting patch synthesis unless recall > 0.8, or penalizing overbroad preamble inclusion via low precision.
IDE plugins or interactive coding assistants can visualize real-time gold-context recall bars, allowing human users to steer attention toward less-covered modules or dependencies.
CI and code review bots can perform context-retrieval diagnostics either in “dry-run” (pre-commit) or as part of pull-request validation, automatically flagging fixes that fail to encompass all components referenced in human patches, thus tightening the gap between automated and human review rigor (Li et al., 5 Feb 2026).

Standardized JSON logs and AST-based alignments render the dataset amenable to downstream research: for example, enabling retriever training via Tree-Sitter alignments or facilitating auxiliary LLM objectives to directly predict gold-relevant context spans.

6. Connections to Broader Benchmarking and Limitations

SWE Context Bench sits at the convergence of recent trends in software agent evaluation. It augments end-to-end task benchmarks—such as SWE-bench (Jimenez et al., 2023) and LongCodeBench (Rando et al., 12 May 2025)—with process-level, context-focused scrutiny. Where earlier work was sensitive to benchmark overfitting and data contamination (e.g., leakage in SWE-bench leading to memorization over reasoning (Liang et al., 14 Jun 2025)), SWE Context Bench interposes intermediate, human-audited signals that illuminate agent behavior even amid high downstream pass rates.

Limitations include annotation cost for gold contexts, absence of dynamic or conversational user interactions (cf. ToM-SWE (Zhou et al., 24 Oct 2025)), and incomplete modeling of distribution shift and catastrophic forgetting. However, process-centric diagnostic instrumentation and sequence-based reuse evaluation offer a principled scaffold for future work on information efficiency, generalization, and interactive memory in software engineering LLM agents.

7. Outlook: Future Directions and Research Opportunities

Emergent research priorities include:

Closing the consolidation gap by unifying retrieval and patch generation in end-to-end LLMs, using auxiliary losses to maximize alignment with human gold context traces.
Scaling human-verified gold context annotation to broader language and repository diversity (multi-language, polyglot repositories, deeper histories).
Integrating robust experience summarization, negative mining, and adaptive retrieval for long-horizon, multi-task workflows as formalized in SWE-ContextBench.
Developing contamination-resistant evaluation harnesses and mutation-based protocols to guard against memorization, as highlighted by recent diagnostics (Liang et al., 14 Jun 2025).
Leveraging context-level diagnostics for tool-use efficiency and continual learning metrics in agentic architectures (Joshi et al., 13 Jun 2025).
Enabling dynamic, user-in-the-loop context guidance via retrieval metrics surfaced to both agents and practitioners.

SWE Context Bench thus establishes a rigorous foundation for advancing interpretable, efficient, and trustworthy LLM-based coding systems—transforming context from a latent variable into an explicit locus of measurement, optimization, and interaction (Li et al., 5 Feb 2026, Zhu et al., 9 Feb 2026).