RExBench: Benchmark for LLM Code Extensions

Updated 3 July 2025

RExBench is a benchmark suite designed to assess LLM-based coding agents in executing complex, nontrivial research extensions with unpublished, expert-driven tasks.
It comprises 12 real-world extension challenges derived from AI/ML research that require agents to understand papers, modify codebases, and validate new experiments automatically.
The suite reveals current limitations in agent planning and comprehension, emphasizing the need for improved error diagnosis and multi-step problem solving in research automation.

RExBench is a benchmark suite designed to measure the capability of LLM-based coding agents to autonomously implement realistic research extensions. Its purpose is to systematically evaluate progress toward LLM agents that can perform nontrivial extensions to existing research codebases, a critical function for automating the research pipeline in machine learning and related scientific domains. Unlike prior benchmarks focused on code repair or isolated algorithmic challenges, RExBench centers on the end-to-end process of understanding a research paper, interpreting expert instructions for novel hypothesis-driven extensions, and executing complex codebase modifications that result in fully validated new experiments.

1. Definition and Construction

RExBench comprises 12 extension tasks derived from published AI/ML research papers, each accompanied by a codebase and a domain expert-written instruction describing a novel, never-before-implemented experiment or research hypothesis. The benchmark is explicitly robust to data contamination: all extension tasks are unpublished and unavailable to public large-scale models, ensuring that LLM outputs cannot simply reflect pre-existing Internet data.

For each task, the agent receives the original paper (typically in PDF or Markdown), the associated codebase, and the extension instruction. The agent then outputs a patch file—a set of code edits—to implement the extension. An automatic evaluation infrastructure applies this patch in a clean environment and runs the resulting pipeline, comparing outputs against expert reference solutions to assess success. The evaluation criteria require high fidelity: for deterministic experiments, output must match exactly; for stochastic pipelines, qualitative and quantitative agreement is enforced by running experiments with seed-controlled, expert-generated references.

2. Design of Tasks and Evaluation Protocol

The suite’s 12 tasks span a representative set of extension scenarios across contemporary AI/ML research, including but not limited to:

Model and algorithmic modifications (e.g., changing architectures, adding new training objectives),
Data and preprocessing changes (e.g., altering datasets in linguistically or empirically controlled ways),
Experimental setup and evaluation procedure adjustments (e.g., implementing alternative metrics or ablations).

An exemplary task requires the agent to modify a linguistics dataset so that novel word replacements correspond to actual English words matched for part-of-speech and frequency, regenerate altered data, rerun experiments, and wrap the whole process in a single script—all adhering to strict output conventions dictated by the instruction.

Evaluation involves a multi-pronged set of outcomes:

Final Success Rate: The proportion of tasks for which the agent’s patch results in output that matches (exact or within seed-bound) the reference.
Execution Success Rate: Whether the edited code runs end-to-end without errors.
File Recall: The fraction of gold-standard modified files that are also edited by the agent, defined by:

$\text{File Recall} = \frac{|\text{Files}_{\text{agent}} \cap \text{Files}_{\text{gold}}|}{|\text{Files}_{\text{gold}}|}$

All evaluations are performed in a containerized environment, mapping tasks to resources such as A100 GPUs, and driven by automated scripts (e.g., run_apptainer.sh). Reference solutions are never public, ensuring evaluation integrity.

3. Agent Benchmarks and Outcomes

Nine LLM agent configurations, spanning three frameworks (aider, Claude Code, OpenHands) and multiple model backbones (Claude 3.7 Sonnet, OpenAI o1, o4-mini, DeepSeek-R1), were systematically assessed on RExBench. Each agent was required to autonomously produce patches implementing each of the 12 tasks.

Observed performances were as follows:

The highest final success rate for any agent/task combination was 25% (OpenHands + Claude 3.7 Sonnet and Claude Code with Claude 3.7 Sonnet).
Most agent/model pairs, particularly those based on OpenAI o1 and DeepSeek-R1, yielded close to 0% success.
Execution success (patch runs without error) exceeded final success, indicating that code which runs is frequently semantically incorrect.
File recall was generally high, suggesting that the agents could often localize the correct files even when their edits failed at the logic level.

The inclusion of human-written hints (both at the information-localization and detailed step-by-step levels) improved performance for all agents, with the best configuration reaching 39% success. However, this improvement saturated quickly, and overly detailed hints sometimes led to performance declines, suggesting bottlenecks in agent comprehension or adaptation to guidance.

Agent	Best Success (%)	Backbone	With Hints
aider	14	Claude 3.7 Sonnet	Up to 17
Claude Code	25	Claude 3.7 Sonnet	Up to 28
OpenHands	25	Claude 3.7 Sonnet	Up to 39
...	...	o1, o4-mini, DeepSeekR1	~0–8

4. Sources of Difficulty and Error Analysis

A salient finding of RExBench is the diverse range of failure modes exhibited by current coding agents, classified as follows:

Complex Codebase Understanding: Tasks frequently require nuanced global reasoning about codebase structure, data flow, and integration, beyond the scope of local modifications.
Planning and Decomposition: Agents often attempt one-shot fixes instead of modular, staged changes, leading to incomplete or non-functional solutions.
Execution Errors: Common Python and programming errors (attribute errors, import problems, file-not-found, syntax errors) and non-operative (empty) code edits were prevalent.
Implicit (Logic) Errors: The most difficult category consists of patches that run successfully but yield the wrong scientific output due to subtle bugs, incorrect hyperparameters, or misaligned logic.
Hint Utilization Difficulties: Hints did not linearly increase success; some agents failed to use granular advice effectively, possibly due to conflicting planning routines between frameworks and LLM self-reasoning.

The primary statistical predictor of agent failure was the number of necessary line changes in the gold-standard reference, rather than codebase size, popularity, or field of the underlying research.

5. Broader Implications and Future Development

RExBench provides strong evidence that, as of 2025, LLM-based coding agents are markedly limited in their capacity to autonomously implement nontrivial research extensions, even with moderate human scaffolding. Explicit agent success is rare, with implicit incorrectness presenting a significant undetected risk for scientific automation.

A plausible implication is the need for advances in agent planning and decomposition, improved comprehension of experimental context, and better integration of guidance and tool-use capabilities. The high prevalence of plausible but incorrect solutions highlights risks to the reproducibility and validity of AI-accelerated research pipelines.

Recognized directions for development include:

Expansion to broader research domains and more complex multi-stage extensions.
Introduction of intermediate, process-level assessments and external checking tools for error localization and diagnosis.
Community curation of new benchmarks and systematic evaluation of both agent and human-augmented workflows.
Continued balancing of benchmark realism (complexity and openness of tasks) with automatable and reproducible assessment methods.
Consideration of societal impacts, particularly in error detection and mitigation, prior to possible deployment in production research environments.

6. Summary

RExBench systematically measures the ability of LLM-driven coding agents to autonomously implement research experiment extensions by requiring integration of novel, unpublished experimental logic into real codebases under rigorous, contamination-robust, and fully automated evaluation. Results to date indicate that while current agents possess limited code search and file-localization ability, they are unable to reliably execute multi-step, logic-rich modifications required for authentic research progress, even with moderate human guidance. RExBench thus establishes a baseline for progress toward autonomous research agents, and serves as a controlled platform for tracking community advancement in this critical area.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to RExBench.