Papers
Topics
Authors
Recent
Search
2000 character limit reached

SWE Context Bench: A Benchmark for Context Learning in Coding

Published 9 Feb 2026 in cs.SE and cs.AI | (2602.08316v1)

Abstract: LLMs are increasingly used as programming agents for repository level software engineering tasks. While recent benchmarks evaluate correctness in realistic codebases, they largely treat tasks as independent and do not assess whether agents can reuse experience across related problems. As a result, the ability of agents to accumulate, retrieve, and apply prior experience, as well as the efficiency gains from such reuse, remains difficult to measure. We introduce SWE-ContextBench, a benchmark designed to explicitly evaluate experience reuse in programming agents. Built on SWE-Bench Lite, SWE-ContextBench augments 300 base tasks with 99 related tasks derived from real dependency and reference relationships among GitHub issues and pull requests, forming task sequences with shared context. The benchmark evaluates agents along three complementary dimensions: prediction accuracy, time efficiency, and cost efficiency. Using SWE-ContextBench, we study multiple experience reuse settings, including oracle guided and autonomous retrieval, as well as full execution trajectories and compact summaries. Our results show that correctly selected summarized experience improves resolution accuracy and substantially reduces runtime and token cost, particularly on harder tasks. In contrast, unfiltered or incorrectly selected experience provides limited or negative benefits. These findings highlight the importance of experience representation and retrieval quality, and position SWE-ContextBench as a principled benchmark for studying experience reuse in programming agents.

Summary

  • The paper introduces SWE-ContextBench, evaluating programming agents on experience reuse across interconnected coding tasks.
  • It leverages 300 tasks from SWE-Bench Lite and 99 GitHub-based tasks to measure performance metrics like accuracy, time efficiency, and cost savings.
  • Results show that precise experience selection, particularly through Oracle Summary Reuse, significantly boosts task resolution and runtime performance.

SWE Context Bench: A Benchmark for Context Learning in Coding

Introduction

The paper "SWE Context Bench: A Benchmark for Context Learning in Coding" (2602.08316) introduces SWE-ContextBench, a groundbreaking approach to evaluating experience reuse in programming agents. With the rise of LLMs as essential tools in software engineering tasks, this benchmark aims to assess an agent's ability to accumulate, retrieve, and apply past experiences in repository-level scenarios. Through SWE-ContextBench, the authors attempt to bridge the gap in existing benchmarks by incorporating context reuse into performance evaluation. The benchmark leverages 300 base tasks from SWE-Bench Lite, augmented by 99 related tasks derived from real-world GitHub interactions. This paper provides insights into the advantages and limitations of experience reuse, emphasizing its significant effects on prediction accuracy, runtime efficiency, and computational cost. Figure 1

Figure 1: Comparison with existing agent benchmark and long context benchmark.

Benchmark Design

Motivation

The core motivation driving the design of SWE-ContextBench is the observed deficiency in existing code benchmarks which fail to measure sequential learning and experience reuse capabilities of programming agents. Despite advances in repository-level evaluations, benchmarks have historically focused on isolated tasks rather than interconnected task sequences. SWE-ContextBench is therefore designed to evaluate how repositories can provide context that aids in solving subsequent tasks efficiently, a practice synonymous with human engineering processes where past successes and failures influence future problem solving.

Desiderata: Benchmark Evaluation Criteria

A robust benchmark should evaluate task accuracy, time efficiency, and cost efficiency:

  1. Accuracy: Correct solutions must be produced, typically verified through human-validated fixes within GitHub repositories.
  2. Time Efficiency: Experience reuse should enable quicker task resolution by leveraging previously identified solutions.
  3. Cost Efficiency: Experience reuse should minimize redundant computational steps, ultimately reflected in lower token usage.

These dimensions are integral to advancing benchmarks beyond merely correctly resolving problems, acknowledging the significance of collective learning.

Data Construction

Base Dataset and Experience Trajectories

SWE-Bench Lite comprises 300 instances drawn from GitHub repositories, each encompassing an issue description and resolving pull requests. The experience trajectory for each task, capturing reasoning processes and tool interactions, forms an accessible pool for subsequent task evaluations. To accurately gauge experience reuse, related tasks are identified through references among issues and pull requests, constructing sequences where context is shared.

Related tasks in SWE-ContextBench emerge from explicit analysis of repository interactions. The benchmark encapsulates varied interdependencies, including multi-issue resolutions and interconnected pull request discussions. Recursive context expansion uncovers task relationships that extend beyond immediate references, enhancing the realism of task sequences.

Experimental Setup and Results

Experience Reuse Settings

The benchmark evaluates different experience reuse settings ranging from no experience to oracle-guided selections. Notably, "Oracle Summary Reuse" demonstrates the highest improvement in task resolution rates, underscoring the importance of concise and correctly chosen prior experiences. Conversely, free experience access fails to consistently enhance performance, implicating the necessity for precise retrieval mechanisms.

Efficiency and Cost Analysis

Oracle Summary Reuse markedly reduces both runtime and cost, particularly for complex tasks, showcasing its practical advantages. This setting minimizes token usage by efficiently leveraging cached experiences, highlighting that compact summaries of prior solutions are more beneficial than comprehensive but potentially irrelevant traces.

Conclusion

SWE-ContextBench represents a pivotal advancement in evaluating programming agents by incorporating experience reuse metrics across well-defined evaluation dimensions. The empirical studies affirm that precise experience selection and representation are vital for improving task resolution rates, runtime efficiency, and token consumption. This benchmark lays the groundwork for future exploration into memory-augmented coding systems, guiding the development of agents that learn cumulatively and adaptively across complex software engineering landscapes.

Paper to Video (Beta)

Whiteboard

Practical Applications

Immediate Applications

Below is a concise set of practical, deployable use cases that leverage the paper’s findings on experience reuse, summary-based memory, and efficiency metrics.

  • Software/DevOps (industry): Pre-procurement benchmarking of code agents using SWE-ContextBench to compare accuracy, time, and cost
    • Tools/workflows: CI job that runs agents on the 99 related tasks; dashboard reporting “Resolved,” FAIL_TO_PASS, PASS_TO_PASS, runtime, and token cost
    • Assumptions/dependencies: Repositories have runnable test suites; reproducible environments; access to agent APIs and token accounting
  • Developer tooling (industry): VSCode/JetBrains plugin that records fix trajectories and auto-generates compact “Fix Cards” (summaries) for reuse on related issues
    • Tools/products: Experience pool manager; trajectory recorder; summary generator; retrieval panel that suggests relevant past fixes
    • Assumptions/dependencies: Teams allow logging of agent actions; storage for summaries; basic issue/PR linking in repos
  • Agent orchestration (software): Cost- and time-aware “memory gate” that prioritizes using concise summaries over full trajectories and includes a fallback to baseline solving
    • Tools/workflows: Policy engine that triggers summary-first prompts; token budget monitor; guardrails for unhelpful memory (disable misleading summaries)
    • Assumptions/dependencies: Reliable token usage metrics; configurable orchestration; stable API pricing
  • QA/Testing (software): Automated guardrails using FAIL_TO_PASS and PASS_TO_PASS as acceptance criteria for AI-generated patches
    • Tools/workflows: CI checks that require all FAIL_TO_PASS tests to pass and enforce no regressions on PASS_TO_PASS
    • Assumptions/dependencies: Comprehensive tests; patch application pipeline; clear separation of test vs solution patches
  • Open-source maintenance (industry/academia): GitHub Actions that mine and surface related issues/PRs to seed an experience pool and encourage better linkage for reuse
    • Tools/workflows: Reference-analyzer action; “related-issues” labels; auto-generated summary comments on merged PRs
    • Assumptions/dependencies: Project maintainers adopt templates and linking practices; access to GitHub metadata
  • AI platform/model cards (industry): Publishing efficiency-aware benchmark results (accuracy, time, cost) for coding agents
    • Products: Model cards that report SWE-ContextBench metrics and retrieval strategies tuned for summary-first reuse
    • Assumptions/dependencies: Benchmark license compatibility; vendor transparency; periodic re-runs for updates
  • Engineering management/Finance (industry): Token and runtime budgeting for agent workloads grounded in benchmark-derived efficiency profiles
    • Tools/workflows: Budget planners; cost dashboards; SLAs that include efficiency targets (e.g., max runtime per task)
    • Assumptions/dependencies: Predictable workloads; known pricing; alignment between engineering and finance
  • Documentation/Knowledge management (industry): Auto-generated “summary of fix” sections stored in a team knowledge base to accelerate future triage
    • Tools/products: Knowledge base integration; semantic search over summaries; “experience reuse” tags
    • Assumptions/dependencies: Adoption in PR templates; quality summarization; data retention policies
  • Security (industry): Minimizing broad context exposure by favoring concise, vetted summaries over full logs to reduce accidental leakage
    • Tools/workflows: Secret scanning on trajectories; retrieval gating policies; summary sanitization steps
    • Assumptions/dependencies: Security tooling integrated into CI; privacy constraints; access control on memory stores
  • Education (academia): Course labs using SWE-ContextBench to teach efficiency trade-offs in agent-assisted software engineering
    • Tools/workflows: Assignments contrasting baseline vs summary reuse; analysis of retrieval quality effects on outcomes
    • Assumptions/dependencies: Compute quotas; dataset access; reproducible environments

Long-Term Applications

Below are forward-looking use cases that build on the benchmark’s insights and require additional research, scaling, or ecosystem development.

  • Production memory OS for agents (software): Organization-wide “Agent Knowledge Graph” that mines relations among issues/PRs and serves high-quality, compact experience to agents
    • Tools/products: MemOS-like memory supervisor; graph-based retrieval; continual summarization and deduplication
    • Assumptions/dependencies: Scalable storage/compute; robust entity/linking; governance around data provenance and access
  • Domain-adapted experience-reuse benchmarks (robotics/healthcare/finance/energy): SWE-ContextBench-style datasets for sequential, related tasks (e.g., maintenance logs, clinical triage, incident resolution)
    • Products: ContextBench variants with validated outcomes beyond code; sector-specific test harnesses
    • Assumptions/dependencies: High-quality, privacy-compliant datasets; standardized evaluation protocols; domain experts for curation
  • Industry standards and certifications (policy/industry): Efficiency-aware benchmarks adopted in procurement and compliance to ensure responsible agent deployment
    • Tools/workflows: Consortium-backed specs for accuracy/time/cost reporting; reproducibility requirements; audit trails
    • Assumptions/dependencies: Multi-stakeholder participation; regulatory alignment; shared reference implementations
  • Learned retrieval/ranking approximating oracle selection (software/ML): Models that predict the “right summary” to present, trained on task relationship signals and resolution outcomes
    • Tools/products: Retrieval models with efficiency rewards; confidence calibration; drift monitoring
    • Assumptions/dependencies: Labels for relatedness; safe RL policies; continuous evaluation
  • Knowledge marketplaces (industry): Secure exchange of anonymized “experience summaries” between companies/teams to accelerate fixes across similar stacks
    • Tools/products: Licensing and anonymization frameworks; federated search; IP-safe sharing protocols
    • Assumptions/dependencies: Legal agreements; privacy tech; standardized summary schemas
  • Continuous-learning agents with memory management (software): Agents that accumulate, compress, and selectively forget experience to stay efficient on large, evolving codebases
    • Tools/workflows: Compression policies; usefulness scoring; automatic pruning and refresh strategies
    • Assumptions/dependencies: Stable criteria for “useful experience”; robust telemetry; safe forgetting mechanisms
  • Repository hygiene and refactoring for reuse (software): Automated suggestions to restructure issues/PRs and tests to maximize reuse potential and evaluability
    • Tools/products: Reference-mining bots; linkage recommendations; test coverage diagnostics tied to FAIL_TO_PASS/PASS_TO_PASS distribution
    • Assumptions/dependencies: Developer buy-in; minimal disruption to existing workflows; support from platform APIs
  • Sustainability reporting (energy/policy): Efficiency-focused agent design tied to compute/energy metrics, enabling greener AI development practices
    • Tools/workflows: Energy-to-token cost mapping; sustainability dashboards; efficiency targets in OKRs
    • Assumptions/dependencies: Reliable conversion of compute to energy; organizational commitment to ESG goals
  • Financial planning for AI operations (finance/industry): Dynamic budget allocation models using efficiency metrics to forecast agent operating expenses and ROI
    • Tools/products: Scenario planners; cost simulators; efficiency SLAs
    • Assumptions/dependencies: Stable pricing; observability over workloads; integrated finance/engineering processes
  • Standardized curricula and assessments (academia/education): Programs emphasizing experience reuse strategies, retrieval quality, and efficiency metrics in AI+SE education
    • Tools/workflows: Benchmark-aligned assignments; capstone projects building memory-augmented agents; shared evaluation rubrics
    • Assumptions/dependencies: Broad adoption of datasets/benchmarks; institutional support; accessible compute resources

Open Problems

We found no open problems mentioned in this paper.

Authors (3)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 2 likes about this paper.

HackerNews