The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements (2506.22419v2)

Published 27 Jun 2025 in cs.AI, cs.CL, and cs.LG

Abstract: Rapid advancements in LLMs have the potential to assist in scientific progress. A critical capability toward this endeavor is the ability to reproduce existing work. To evaluate the ability of AI agents to reproduce results in an active research area, we introduce the Automated LLM Speedrunning Benchmark, leveraging the research community contributions on the NanoGPT speedrun, a competition to train a GPT-2 model in the shortest time. Each of the 19 speedrun tasks provides the agent with the previous records training script, optionally paired with one of three hint formats, ranging from pseudocode to paper-like descriptions of the new records improvements. Records execute quickly by design and speedrun improvements encompass diverse code-level changes, ranging from high-level algorithmic advancements to hardware-aware optimizations. These features make the benchmark both accessible and realistic for the frontier problem of improving LLM training. We find that recent reasoning LLMs combined with SoTA scaffolds struggle to reimplement already-known innovations in our benchmark, even when given detailed hints. Our benchmark thus provides a simple, non-saturated measure of an LLMs ability to automate scientific reproduction, a necessary (but not sufficient) skill for an autonomous research agent.

Summary

The paper introduces a benchmark that evaluates LLM research agents by requiring them to reproduce sequential NanoGPT speedrun innovations.
It employs iterative code modifications and various search scaffolds to measure progress using the fraction of speedup recovered metric.
Empirical results show that even with detailed hints, current LLM agents recover only a fraction of the human-achieved speedups, underscoring reproducibility challenges.

The Automated LLM Speedrunning Benchmark: Evaluating Reproducibility in LLM Research Agents

The paper introduces the Automated LLM Speedrunning Benchmark, a novel evaluation suite designed to assess the ability of LLM-based research agents to reproduce incremental scientific improvements in the context of LLM training. The benchmark is constructed from the NanoGPT Speedrun, a community-driven competition focused on minimizing the wall-clock time required to train a GPT-2 model to a fixed validation loss on a standardized dataset and hardware configuration. Each of the 19 benchmark tasks corresponds to a transition between consecutive speedrun records, each representing a concrete, code-level innovation that reduced training time.

Benchmark Design and Motivation

The benchmark is motivated by the centrality of reproducibility in scientific progress, particularly in machine learning, where the ability to faithfully reimplement and validate published results is foundational. Unlike prior reproducibility benchmarks that focus on isolated papers or codebases, this benchmark is sequential and cumulative: agents are tasked with reproducing each successive improvement, starting from the previous record’s codebase. This design enables the evaluation of an agent’s ability to follow a chain of innovations, reflecting the iterative nature of real-world research.

Each task provides the agent with the previous record’s training script and, optionally, one or more hints describing the next improvement. Hints are available in three formats: pseudocode, natural language description, and a mini-paper summary. The benchmark’s success metric is the fraction of speedup recovered (FSR), defined as the proportion of the human-achieved training time reduction that the agent is able to match.

Agent Scaffolds and Evaluation Protocol

The authors implement a flexible search scaffold, extending the AIDE framework, to structure the agent’s search for improved solutions. The scaffold supports various search strategies, including flat (best-of-M), tree, forest, AIDE, and multi-AIDE, each parameterized by branching factor, debug probability, and search depth. At each step, the agent generates code modifications, executes the resulting script on a fixed 8xH100 node cluster, and analyzes the outcome to guide further search.

The evaluation covers four leading LLMs (DeepSeek-R1, o3-mini, Gemini-2.5-Pro, Claude-3.7-Sonnet) across all search scaffolds and hint regimes. Each agent is allotted a fixed search budget, and performance is measured both in terms of FSR and code similarity to human solutions, using both embedding-based and LLM-judge metrics.

Empirical Findings

The results reveal several key insights:

Current LLM agents, even when provided with detailed hints, recover only a fraction of the speedup achieved by human experts. For example, the best-performing agent (o3-mini with multi-AIDE and all hints) recovers approximately 46% of the human speedup on average.
Hints are necessary but not sufficient: Without hints, agents recover less than 20% of the speedup. Pseudocode hints are generally the most effective, but combining multiple hint types can sometimes degrade performance, particularly for models less capable of handling long contexts.
Search strategy matters: Flat search often matches or outperforms more complex iterative scaffolds for individual tasks, but multi-AIDE provides the best aggregate performance when combining hints.
Model differences are pronounced: o3-mini and Claude-3.7-Sonnet outperform open-weight models like DeepSeek-R1, especially in iterative search settings. However, Claude-3.7-Sonnet generates a higher proportion of buggy solutions, which negatively impacts its interquartile mean performance.
Cumulative reproduction is challenging: When agents are required to build on their own previous solutions (rather than ground-truth human code), performance degrades rapidly, with the agent failing to recover any speedup by the fourth record in the chain.
External knowledge integration remains weak: Providing agents with documentation for novel modules (e.g., FlexAttention) does not improve, and can even harm, performance, indicating limitations in leveraging out-of-distribution knowledge.

Implications and Limitations

The benchmark exposes a significant gap between the current capabilities of LLM-based research agents and the requirements for reliable, automated scientific reproduction. Even with access to explicit, high-level descriptions of code changes, agents struggle to implement non-trivial optimizations in a complex, real-world codebase. This finding challenges optimistic assumptions about the near-term prospects of fully autonomous research agents and highlights reproducibility as a critical bottleneck.

From a practical perspective, the benchmark provides a rigorous, non-saturated testbed for tracking progress in agentic AI research. Its design—grounded in real, impactful LLM training improvements—ensures relevance to both the research community and practitioners focused on accelerating model development cycles.

The authors note several limitations and directions for future work:

Scaling external knowledge: The current benchmark provides succinct, manually curated hints. Realistic research agents will need to retrieve, filter, and integrate large, heterogeneous sources of information, potentially exceeding context window limits.
Memorization vs. generalization: As models are trained on increasingly comprehensive corpora, disentangling genuine reasoning from memorization of benchmark solutions will require careful analysis.
Beyond code similarity: Future evaluations could incorporate semantic diffing and natural language explanations to better capture the fidelity and novelty of agent-generated solutions.
Generalization to broader ML tasks: While focused on LLM training, the benchmark’s methodology could be extended to other domains, including multi-file codebases, distributed training, and optimization for metrics beyond speed.

Societal and Research Impact

The Automated LLM Speedrunning Benchmark sets a new standard for evaluating the reproducibility capabilities of AI research agents. Its findings suggest that, despite rapid progress in LLM reasoning and code generation, substantial advances are required before agents can reliably automate the reproduction of state-of-the-art scientific results. The benchmark’s open-source release will facilitate community-driven progress and provide a valuable yardstick for future developments in autonomous scientific discovery.

In summary, this work provides a rigorous, practically grounded assessment of the current limits of LLM-based research automation, and establishes a challenging, extensible benchmark for the next generation of AI research agents.

PDF Markdown

Related Papers

Tweets

https://twitter.com/MinqiJiang/status/1939656634990469293

https://twitter.com/gm8xx8/status/1939838360429957392

https://twitter.com/iScienceLuvr/status/1939613830176514087

https://twitter.com/SamuelAlbanie/status/1941184123298775116

https://twitter.com/AtharvaIngle7/status/1939933681935593505

https://twitter.com/dongxi_nlp/status/1939682539956052028

YouTube

Show All Videos