ResearchCodeBench: ML Code Benchmark

Updated 20 December 2025

ResearchCodeBench is a benchmark that evaluates LLMs' ability to translate novel machine learning research from recent papers into correct, executable code.
It employs deterministic, execution-based testing protocols in standardized Docker/Conda environments to ensure reproducibility and measure metrics like Pass@1 and Scaled Pass Rate.
Its extensible framework, built on contributions from top conferences, allows continuous updates and community submissions to track evolving ML research innovations.

ResearchCodeBench is a community-driven benchmark designed to rigorously assess LLMs on their ability to implement code for novel machine learning research contributions, specifically those described in recent research papers and thus unseen during the models’ pretraining. This evaluation framework emphasizes deterministic, execution-based correctness for core conceptual code in contemporary scientific literature, directly addressing the limitations of prior benchmarks that have mostly focused on well-known algorithms or engineering tasks. It offers a scalable methodology for extracting and unit-testing central innovations in ML research code, explicitly controlling for data contamination, and is extensible with continuous updates as new papers appear and new model versions are released (Hua et al., 2 Jun 2025).

1. Motivation and Distinctive Goals

ResearchCodeBench shifts the evaluation paradigm from standard code completion benchmarks (e.g., HumanEval, MBPP, SWE-bench, BigCodeBench) to a focus on translating new scientific ideas into executable code. The motivation is twofold:

Research code generation is inherently different from vanilla code synthesis, as it requires scientific reasoning and faithful translation of novel concepts, often described in the text and equations of recent research papers and still unseen by any model at time of evaluation.
Prior “subjective” correctness metrics (peer-review style, or LLM-based judging) are inconsistent with functional code correctness, especially in cutting-edge scientific contexts. ResearchCodeBench thus enforces evaluation strictly through executable tests.

This approach fills notable gaps: Many previous benchmarks either verify existing canonical codebases or test implementations of standard, likely memorized algorithms rather than the core new mechanisms at the heart of current ML research. ResearchCodeBench isolates and tests the most innovative portions of recent contributions, ensuring evaluation on truly out-of-distribution material (Hua et al., 2 Jun 2025).

2. Benchmark Construction and Task Taxonomy

ResearchCodeBench is constructed around contributions from 20 highly impactful papers published in NeurIPS, ICLR, CVPR, and arXiv from 2024–2025, explicitly targeting state-of-the-art generative models, optimization procedures, loss functions, and sampling algorithms. Annotators extract the core scientific code from each repository and design fill-in-the-blank snippets focused on these novel mechanisms. The dataset comprises 212 independently-unit-tested code challenges.

Challenges are not formally bucketed, but fall into these empirically-derived categories:

Architecture components (e.g., Transformer block variants, hypernetworks).
Algorithmic updates (e.g., min-max steps, custom optimization routines).
Loss computation and gradient mechanisms (e.g., diffusion/objective innovations).
Data pipelines and samplers (e.g., Gumbel or min-p sample selectors).
Analysis and mathematical theory code (e.g., fixed-point initializations, isometry checking).

Each snippet is defined by explicit XML-style tags (e.g., # <ResearchCodeBench hint> ... # </ResearchCodeBench hint>), with code masked and a corresponding natural language hint provided for LLM prompting. The formal context for each snippet is minimal, and model inputs always include only the latest research paper text, relevant code context, and the masked block (Hua et al., 2 Jun 2025).

3. Evaluation Protocol: Functional Validation

A deterministic, execution-based testing pipeline is central:

All codebases and custom tests are run in a uniform Docker or Conda environment without GPU.
Test harnesses employ unit tests and functional equivalence checks, with random seeds fixed for reproducibility.
Each model receives a prompt comprising the flattened paper PDF (≈30,000 tokens), code context (≈20,000 tokens), and the masked code snippet.
Functional correctness is measured via unit testing: a snippet passes if it produces the correct outputs across all tests.

Primary metrics are:

Pass rate: fraction of code snippets passed for a model.
Scaled Pass Rate: Pass rate weighted by the number of lines of code completed.

$\mathrm{PassRate} = \frac{\#\{s : s\text{ passes all tests}\}}{\#\{s : \text{all snippets}\}}$

$\mathrm{ScaledPassRate} = \frac{\sum_{s\in S_{\mathrm{passed}}} \mathrm{LoC}(s)}{\sum_{s\in S_{\mathrm{all}}} \mathrm{LoC}(s)}$

These metrics are computed using greedy decoding (Pass@1); there is no sampling-based approach (Hua et al., 2 Jun 2025).

4. Contamination Analysis and Model Selection

To decontaminate the evaluation, every paper’s repository commit history is compared to the pretraining cutoff date for each evaluated model. A subset of 13 papers with first commits strictly after all model cutoffs allows for robust, contamination-safe evaluation. Over 30 models (Google Gemini-2.5-Pro, OpenAI GPT-4.1, O3 High, O4-mini High, Claude 3.5 Sonnet, Meta LLaMA, etc.) are evaluated, both proprietary and open-source.

Relative rankings remain stable when evaluated on the uncontaminated subset, but all models lose 3–8 percentage points in pass rate, confirming minimal leakage. Top models rely heavily on textual paper context: performance for Gemini and O3 increases by 20–30% absolute when given the paper text, while smaller models show much less benefit (Hua et al., 2 Jun 2025).

5. Empirical Performance Results

The Scaled Pass@1 results underscore persistent difficulty:

Model	Scaled Pass@1 (%)
Gemini-2.5-Pro-Preview	37.3
O3 (High)	32.3
O4-mini (High)	30.8
GPT-4.1	~28
Claude 3.5 Sonnet	~25
LLaMA-2-70B (open)	~15

By category, observed success rates are low across the board—architecture components: ~22%, optimization: ~29%, loss/gradients: ~31%, samplers/pipelines: ~20%, theory routines: ~25%.

Ablations demonstrate reliance on long paper context for high-end models, and negligible impact for small LLaMA-class models (Hua et al., 2 Jun 2025).

6. Error Modes and Analysis

Observed error distribution:

Functional (semantic) errors: 58.6%
NameError: 8.7%
SyntaxError/Indentation: 8.4%
TypeError: 8.1%
ImportError/ModuleNotFound: 6.9%
AttributeError: 6.3%
IndexError/KeyError: 2.3%

Semantic failures dominated (e.g., incorrect algorithmic implementations, wrong use of activations or math). Syntactic errors—indentation, missing imports, undefined names—form a substantial but secondary class. These results indicate that present-day LLMs fail more on deep scientific reasoning and mathematical fidelity than on superficial syntax (Hua et al., 2 Jun 2025).

7. Community Infrastructure and Continuous Extension

ResearchCodeBench is designed for extensibility and openness:

Community submissions for new tasks are accepted via standard forms, with metadata for each problem tracked in a persistent YAML registry.
Evaluation scripts, environments (Docker/Conda), and prompt templates are open-source (https://researchcodebench.github.io/).
The leaderboard is updated live as new papers and models are tested.

This infrastructure enables ongoing, robust assessment of model progress as scientific literature and foundation models evolve. By combining focused challenge definition, functional testing, rigorous contamination control, and community extensibility, ResearchCodeBench is positioned as a canonical resource for the evaluation and advancement of research-oriented code generation (Hua et al., 2 Jun 2025).

Related Benchmarks and Contextual Significance

ResearchCodeBench builds on lessons from CodeFlowBench (multi-turn, dependency-aware code generation (Wang et al., 30 Apr 2025)), BigCodeBench (API-rich, compositional tool use (Zhuo et al., 22 Jun 2024)), as well as domain-specific efforts such as BioCoder (bioinformatics, topic coverage, domain adaptation (Tang et al., 2023)). By focusing evaluation on scientific novelty and deterministic functional correctness, ResearchCodeBench uniquely calibrates LLM capabilities against the demands of rapid scientific innovation, closing a critical gap in benchmarking methodology.