CursorBench: Realistic Code-Generation Benchmark

Updated 28 March 2026

CursorBench is an evolving evaluation suite designed for agentic code generation, evaluating real-world tasks like debugging and refactoring using authentic Cursor sessions.
It leverages mining archived interactive sessions and replays tasks in isolated production-like VMs to ensure realistic, end-to-end performance measurement.
Benchmark results show significant improvements in frontier coding agents such as Composer 2, highlighting long-horizon planning and efficient toolchain execution.

CursorBench is an internal, continuously evolving evaluation suite designed for agentic code generation and software engineering. Derived directly from authentic Cursor sessions conducted by an engineering team, CursorBench provides realistic, open-ended software-engineering tasks rooted in complex, large-scale codebases. Its primary purpose is to address the shortcomings of existing public benchmarks by representing the diverse, large-horizon problems encountered in production environments, including debugging, large-scale refactorings, and ambiguous bug reports. This benchmark plays a central role in the development and assessment of frontier coding agents, notably Composer 2 (Research et al., 25 Mar 2026).

1. Design Principles and Motivations

CursorBench was conceived to fill critical gaps in existing code-generation benchmarks. While prior public benchmarks often focus on narrowly scoped or contrived tasks—such as bug fixes or synthetic puzzles—CursorBench encompasses tasks that typify real-world software engineering. These include debugging production failures, diagnosing build-tool edge cases, executing large-scale refactorings, adding major features, performance tuning, monitoring long-running experiments, complex data analysis, and resolving version-control conflicts. The tasks reflect the diversity of environments—spanning backend monorepos, frontend frameworks, data pipelines, machine learning infrastructure, and observability tools—and incorporate real user artifacts such as terse bug tickets, observability logs, and partial error stacks. The guiding philosophy is to measure not just isolated code manipulation, but agents’ capacity to make decisions and execute sequences of actions under authentic conditions (Research et al., 25 Mar 2026).

2. Dataset Structure and Task Composition

Tasks in CursorBench are selected by mining archived interactive Cursor sessions. Each task is replayed using the same Cursor harness employed for RL training and model deployment, ensuring a high-fidelity match between training and evaluation environments. Execution occurs within isolated Anyrun virtual machines (VMs) provisioned with the production toolchain stack, including capabilities such as file I/O, shell commands, grep/semantic search, and web search. Prompts are intentionally terse (median 390 characters) and are accompanied by any directly linked logs or code pointers. Rather than prescribing narrow solutions, agents are permitted to devise any valid architectural approach.

The problem diversity includes:

Bug fixes: e.g., diagnosing runtime or build transpilation errors, such as esbuild down-leveling issues.
Feature additions: e.g., implementing new API endpoints or extending UI flows.
Refactorings: e.g., renaming classes across hundreds of files or migrating entire codebases to newer library versions.
Heuristic and data-analysis tasks: e.g., developing and tuning scripts over extensive log datasets.
Test-writing: creating or adapting comprehensive test suites.

Comparative statistics highlight the suite’s realism and difficulty:

Benchmark	Median Lines Changed	Median Prompt Length (chars)
CursorBench-3	181	390
SWE-bench Verified	7–10	1,185–3,055
SWE-bench Multilingual	7–10	1,185–3,055

Across iterations (CursorBench-1 to CursorBench-3), both the number of files changed and lines modified per task have more than doubled, indicating increasing challenge and scope (Research et al., 25 Mar 2026).

3. Evaluation Protocol and Metrics

Each agent is assessed end-to-end in a live Anyrun VM, executing a sequence of tool calls $(a_1, \ldots, a_t)$ , which produce a corresponding sequence of environment states $(y_1, \ldots, y_t)$ , culminating in a series of codebase modifications. The agent’s output is validated by executing the project’s test suite or comparing modifications against a golden diff. To ensure statistical reliability, each evaluation is repeated $R$ times, with accuracy reported as:

$\text{Accuracy} = 100 \cdot \frac{1}{R\,|T|} \sum_{r=1}^R \sum_{t \in T} \mathbb{1}[\text{solution}_{r,t}\ \text{correct}]$

Additional evaluation dimensions include:

Completion tokens: measuring inference efficiency.
End-to-end latency: wall-clock duration from task initiation to completion.
Inference cost per task: measured in dollars per evaluation.

Rigorous environment controls are maintained: identical VM snapshots, tool versions, system prompt templates, and RPC services (semantic search, shadow Cursor-backend) are used in both training and evaluation to eliminate train/test mismatch and ensure fair tool interoperability (Research et al., 25 Mar 2026).

4. Benchmark Results and Model Performance

Composer 2’s performance on CursorBench demonstrates significant advances in coding intelligence and planning at scale. On CursorBench-3, Composer 2 achieves 61.3% accuracy, representing a 37% improvement over Composer 1.5 (44.2%) and a 61% improvement over Composer 1 (38.0%). Comparative performance for top models is summarized below:

Model	CursorBench-3 Accuracy (%)
GPT-5.4	63.9
Composer 2	61.3
Opus 4.6 (high-effort)	58.2
Composer 1.5	44.2
Composer 1	38.0

Composer 2 achieves a strong position on the cost–accuracy Pareto frontier, attaining token counts typical of other frontier systems but with inference costs similar to more compact models.

Factors driving model performance include:

Long-horizon planning via self-summarization chains: enables context retention and strategic chunking over extended tool call sequences.
Multi-step execution coherence: enforced using RL rewards that penalize unnecessary code churn and reward minimal, correct diffs.
Behavioral rewards: such as nonlinear length penalties,

$C_{\text{length}}(x) = \frac{(1 + kx)^{1-q} - 1}{k (1-q)}$

promoting solution succinctness on easy problems while accommodating deeper reasoning for complex cases (Research et al., 25 Mar 2026).

5. Benchmarking Philosophy and Research Implications

CursorBench addresses several key evaluation needs within agentic code generation. By prioritizing underspecified prompts, multi-file patches, and genuine debugging and feature development, it provides a highly realistic, end-to-end assessment environment. Tasks are uncontaminated by public datasets, as they are sourced from closed, internal engineering workflows, ensuring novel challenge and evaluation integrity. CursorBench is thus positioned as a contamination-free benchmark, valuable for probing both generative coding and broader agentic capabilities—requiring navigation of toolchains and interpretation of ambiguous developer intent (Research et al., 25 Mar 2026).

An important design feature is the minimal specification of solutions: agents are evaluated on their ability to autonomously synthesize plausible end-to-end approaches rather than solve artificially narrow or over-instructed problems.

6. Limitations, Evolution, and Prospective Directions

Despite its capabilities, CursorBench remains proprietary and inaccessible to the broader public. However, the underlying methodology—mining authentic development sessions, replaying in high-fidelity environments, and evaluating end-to-end agent behavior—can be replicated on large open-source monorepos.

CursorBench is designed for continuous evolution: as developer practices adapt (e.g., increasing prevalence of data analysis tasks or long-running machine learning jobs), new task types and codebases are incorporated. Planned extensions include expanded task horizon (multi-day or multi-agent orchestration), aiming to evaluate sustained correctness over thousands of agent actions (Research et al., 25 Mar 2026).

A plausible implication is that as agentic systems mature, benchmarks such as CursorBench will serve as the foundational testbeds for measuring real-world readiness—beyond what is possible with current public alternatives.

Markdown Report Issue Upgrade to Chat

References (1)

Composer 2 Technical Report (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CursorBench.