SWE Context Bench: A Benchmark for Context Learning in Coding

Published 9 Feb 2026 in cs.SE and cs.AI | (2602.08316v1)

Abstract: LLMs are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.

Abstract PDF Upgrade to Chat

Summary

The paper introduces SWE-ContextBench, evaluating programming agents on experience reuse across interconnected coding tasks.
It leverages 300 tasks from SWE-Bench Lite and 99 GitHub-based tasks to measure performance metrics like accuracy, time efficiency, and cost savings.
Results show that precise experience selection, particularly through Oracle Summary Reuse, significantly boosts task resolution and runtime performance.

SWE Context Bench: A Benchmark for Context Learning in Coding

Introduction

The paper "SWE Context Bench: A Benchmark for Context Learning in Coding" (2602.08316) introduces SWE-ContextBench, a groundbreaking approach to evaluating experience reuse in programming agents. With the rise of LLMs as essential tools in software engineering tasks, this benchmark aims to assess an agent's ability to accumulate, retrieve, and apply past experiences in repository-level scenarios. Through SWE-ContextBench, the authors attempt to bridge the gap in existing benchmarks by incorporating context reuse into performance evaluation. The benchmark leverages 300 base tasks from SWE-Bench Lite, augmented by 99 related tasks derived from real-world GitHub interactions. This paper provides insights into the advantages and limitations of experience reuse, emphasizing its significant effects on prediction accuracy, runtime efficiency, and computational cost.

Figure 1: Comparison with existing agent benchmark and long context benchmark.

Benchmark Design

Motivation

The core motivation driving the design of SWE-ContextBench is the observed deficiency in existing code benchmarks which fail to measure sequential learning and experience reuse capabilities of programming agents. Despite advances in repository-level evaluations, benchmarks have historically focused on isolated tasks rather than interconnected task sequences. SWE-ContextBench is therefore designed to evaluate how repositories can provide context that aids in solving subsequent tasks efficiently, a practice synonymous with human engineering processes where past successes and failures influence future problem solving.

Desiderata: Benchmark Evaluation Criteria

A robust benchmark should evaluate task accuracy, time efficiency, and cost efficiency:

Accuracy: Correct solutions must be produced, typically verified through human-validated fixes within GitHub repositories.
Time Efficiency: Experience reuse should enable quicker task resolution by leveraging previously identified solutions.
Cost Efficiency: Experience reuse should minimize redundant computational steps, ultimately reflected in lower token usage.

These dimensions are integral to advancing benchmarks beyond merely correctly resolving problems, acknowledging the significance of collective learning.

Data Construction

Base Dataset and Experience Trajectories

SWE-Bench Lite comprises 300 instances drawn from GitHub repositories, each encompassing an issue description and resolving pull requests. The experience trajectory for each task, capturing reasoning processes and tool interactions, forms an accessible pool for subsequent task evaluations. To accurately gauge experience reuse, related tasks are identified through references among issues and pull requests, constructing sequences where context is shared.

Related tasks in SWE-ContextBench emerge from explicit analysis of repository interactions. The benchmark encapsulates varied interdependencies, including multi-issue resolutions and interconnected pull request discussions. Recursive context expansion uncovers task relationships that extend beyond immediate references, enhancing the realism of task sequences.

Experimental Setup and Results

Experience Reuse Settings

The benchmark evaluates different experience reuse settings ranging from no experience to oracle-guided selections. Notably, "Oracle Summary Reuse" demonstrates the highest improvement in task resolution rates, underscoring the importance of concise and correctly chosen prior experiences. Conversely, free experience access fails to consistently enhance performance, implicating the necessity for precise retrieval mechanisms.

Efficiency and Cost Analysis

Oracle Summary Reuse markedly reduces both runtime and cost, particularly for complex tasks, showcasing its practical advantages. This setting minimizes token usage by efficiently leveraging cached experiences, highlighting that compact summaries of prior solutions are more beneficial than comprehensive but potentially irrelevant traces.

Conclusion

SWE-ContextBench represents a pivotal advancement in evaluating programming agents by incorporating experience reuse metrics across well-defined evaluation dimensions. The empirical studies affirm that precise experience selection and representation are vital for improving task resolution rates, runtime efficiency, and token consumption. This benchmark lays the groundwork for future exploration into memory-augmented coding systems, guiding the development of agents that learn cumulatively and adaptively across complex software engineering landscapes.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

SWE Context Bench: A Benchmark for Context Learning in Coding

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Practical Applications

Immediate Applications

Below is a concise set of practical, deployable use cases that leverage the paper’s findings on experience reuse, summary-based memory, and efficiency metrics.

Software/DevOps (industry): Pre-procurement benchmarking of code agents using SWE-ContextBench to compare accuracy, time, and cost
- Tools/workflows: CI job that runs agents on the 99 related tasks; dashboard reporting “Resolved,” FAIL_TO_PASS, PASS_TO_PASS, runtime, and token cost
- Assumptions/dependencies: Repositories have runnable test suites; reproducible environments; access to agent APIs and token accounting
Developer tooling (industry): VSCode/JetBrains plugin that records fix trajectories and auto-generates compact “Fix Cards” (summaries) for reuse on related issues
- Tools/products: Experience pool manager; trajectory recorder; summary generator; retrieval panel that suggests relevant past fixes
- Assumptions/dependencies: Teams allow logging of agent actions; storage for summaries; basic issue/PR linking in repos
Agent orchestration (software): Cost- and time-aware “memory gate” that prioritizes using concise summaries over full trajectories and includes a fallback to baseline solving
- Tools/workflows: Policy engine that triggers summary-first prompts; token budget monitor; guardrails for unhelpful memory (disable misleading summaries)
- Assumptions/dependencies: Reliable token usage metrics; configurable orchestration; stable API pricing
QA/Testing (software): Automated guardrails using FAIL_TO_PASS and PASS_TO_PASS as acceptance criteria for AI-generated patches
- Tools/workflows: CI checks that require all FAIL_TO_PASS tests to pass and enforce no regressions on PASS_TO_PASS
- Assumptions/dependencies: Comprehensive tests; patch application pipeline; clear separation of test vs solution patches
Open-source maintenance (industry/academia): GitHub Actions that mine and surface related issues/PRs to seed an experience pool and encourage better linkage for reuse
- Tools/workflows: Reference-analyzer action; “related-issues” labels; auto-generated summary comments on merged PRs
- Assumptions/dependencies: Project maintainers adopt templates and linking practices; access to GitHub metadata
AI platform/model cards (industry): Publishing efficiency-aware benchmark results (accuracy, time, cost) for coding agents
- Products: Model cards that report SWE-ContextBench metrics and retrieval strategies tuned for summary-first reuse
- Assumptions/dependencies: Benchmark license compatibility; vendor transparency; periodic re-runs for updates
Engineering management/Finance (industry): Token and runtime budgeting for agent workloads grounded in benchmark-derived efficiency profiles
- Tools/workflows: Budget planners; cost dashboards; SLAs that include efficiency targets (e.g., max runtime per task)
- Assumptions/dependencies: Predictable workloads; known pricing; alignment between engineering and finance
Documentation/Knowledge management (industry): Auto-generated “summary of fix” sections stored in a team knowledge base to accelerate future triage
- Tools/products: Knowledge base integration; semantic search over summaries; “experience reuse” tags
- Assumptions/dependencies: Adoption in PR templates; quality summarization; data retention policies
Security (industry): Minimizing broad context exposure by favoring concise, vetted summaries over full logs to reduce accidental leakage
- Tools/workflows: Secret scanning on trajectories; retrieval gating policies; summary sanitization steps
- Assumptions/dependencies: Security tooling integrated into CI; privacy constraints; access control on memory stores
Education (academia): Course labs using SWE-ContextBench to teach efficiency trade-offs in agent-assisted software engineering
- Tools/workflows: Assignments contrasting baseline vs summary reuse; analysis of retrieval quality effects on outcomes
- Assumptions/dependencies: Compute quotas; dataset access; reproducible environments

Long-Term Applications

Below are forward-looking use cases that build on the benchmark’s insights and require additional research, scaling, or ecosystem development.

Production memory OS for agents (software): Organization-wide “Agent Knowledge Graph” that mines relations among issues/PRs and serves high-quality, compact experience to agents
- Tools/products: MemOS-like memory supervisor; graph-based retrieval; continual summarization and deduplication
- Assumptions/dependencies: Scalable storage/compute; robust entity/linking; governance around data provenance and access
Domain-adapted experience-reuse benchmarks (robotics/healthcare/finance/energy): SWE-ContextBench-style datasets for sequential, related tasks (e.g., maintenance logs, clinical triage, incident resolution)
- Products: ContextBench variants with validated outcomes beyond code; sector-specific test harnesses
- Assumptions/dependencies: High-quality, privacy-compliant datasets; standardized evaluation protocols; domain experts for curation
Industry standards and certifications (policy/industry): Efficiency-aware benchmarks adopted in procurement and compliance to ensure responsible agent deployment
- Tools/workflows: Consortium-backed specs for accuracy/time/cost reporting; reproducibility requirements; audit trails
- Assumptions/dependencies: Multi-stakeholder participation; regulatory alignment; shared reference implementations
Learned retrieval/ranking approximating oracle selection (software/ML): Models that predict the “right summary” to present, trained on task relationship signals and resolution outcomes
- Tools/products: Retrieval models with efficiency rewards; confidence calibration; drift monitoring
- Assumptions/dependencies: Labels for relatedness; safe RL policies; continuous evaluation
Knowledge marketplaces (industry): Secure exchange of anonymized “experience summaries” between companies/teams to accelerate fixes across similar stacks
- Tools/products: Licensing and anonymization frameworks; federated search; IP-safe sharing protocols
- Assumptions/dependencies: Legal agreements; privacy tech; standardized summary schemas
Continuous-learning agents with memory management (software): Agents that accumulate, compress, and selectively forget experience to stay efficient on large, evolving codebases
- Tools/workflows: Compression policies; usefulness scoring; automatic pruning and refresh strategies
- Assumptions/dependencies: Stable criteria for “useful experience”; robust telemetry; safe forgetting mechanisms
Repository hygiene and refactoring for reuse (software): Automated suggestions to restructure issues/PRs and tests to maximize reuse potential and evaluability
- Tools/products: Reference-mining bots; linkage recommendations; test coverage diagnostics tied to FAIL_TO_PASS/PASS_TO_PASS distribution
- Assumptions/dependencies: Developer buy-in; minimal disruption to existing workflows; support from platform APIs
Sustainability reporting (energy/policy): Efficiency-focused agent design tied to compute/energy metrics, enabling greener AI development practices
- Tools/workflows: Energy-to-token cost mapping; sustainability dashboards; efficiency targets in OKRs
- Assumptions/dependencies: Reliable conversion of compute to energy; organizational commitment to ESG goals
Financial planning for AI operations (finance/industry): Dynamic budget allocation models using efficiency metrics to forecast agent operating expenses and ROI
- Tools/products: Scenario planners; cost simulators; efficiency SLAs
- Assumptions/dependencies: Stable pricing; observability over workloads; integrated finance/engineering processes
Standardized curricula and assessments (academia/education): Programs emphasizing experience reuse strategies, retrieval quality, and efficiency metrics in AI+SE education
- Tools/workflows: Benchmark-aligned assignments; capstone projects building memory-augmented agents; shared evaluation rubrics
- Assumptions/dependencies: Broad adoption of datasets/benchmarks; institutional support; accessible compute resources

SWE Context Bench: A Benchmark for Context Learning in Coding

Summary

SWE Context Bench: A Benchmark for Context Learning in Coding

Introduction

Benchmark Design

Motivation

Desiderata: Benchmark Evaluation Criteria

Data Construction

Base Dataset and Experience Trajectories

Experimental Setup and Results

Experience Reuse Settings

Efficiency and Cost Analysis

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Authors (3)

Collections

Tweets

HackerNews

SWE Context Bench: A Benchmark for Context Learning in Coding

Summary

SWE Context Bench: A Benchmark for Context Learning in Coding

Introduction

Benchmark Design

Motivation

Desiderata: Benchmark Evaluation Criteria

Data Construction

Base Dataset and Experience Trajectories

Related Task Identification and Context Expansion

Experimental Setup and Results

Experience Reuse Settings

Efficiency and Cost Analysis

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Practical Applications

Immediate Applications

Long-Term Applications

Open Problems

Continue Learning

Related Papers

Authors (3)

Collections

Tweets

HackerNews