Automated LLM Speedrunning Benchmark

Updated 1 July 2025

The Automated LLM Speedrunning Benchmark is a framework that measures LLM agents' ability to quickly and reliably reproduce and improve complex computational tasks based on the NanoGPT Speedrun.
This benchmark assesses LLM agents' critical capacity for scientific reproduction, highlighting bottlenecks in their ability to automate experimental science toward autonomous research systems.
Current LLM agents tested on the benchmark recover less than 50% of human-achieved speedups and face challenges with debugging, chaining improvements, and processing complex hints.

The automated LLM speedrunning benchmark is a framework for measuring the capacity of LLMs and research agents to rapidly, reliably, and reproducibly implement and improve sophisticated computational tasks. Originating from a principled need to evaluate not just models’ performance on static datasets, but their ability to continually match or exceed new advancements, the benchmark operationalizes the concept of “speedrunning” as it relates to both scientific reproducibility and automated research. This design is centered on the NanoGPT Speedrun—a competitive, community-driven challenge to train GPT-2 models as quickly as possible—and extends to a sequence of code-level tasks, each corresponding to a verified innovation by human researchers. The framework systematically evaluates LLM agents’ abilities to automate the process of scientific reproduction, a critical capability in the pathway toward autonomous research systems (Zhao et al., 27 Jun 2025).

1. Benchmark Structure and Methodology

The Automated LLM Speedrunning Benchmark (ALLMsB) is constructed from a set of sequential tasks, each representing the reproduction of a documented “record improvement” in LLM training speed on NanoGPT. Each task is defined by a pair of scripts and timings— $(R_{i-1}, t_{i-1})$ (predecessor) and $(R_{i}, t_i)$ (successor)—and the goal is for an agent, given access to $R_{i-1}$ (with or without an informative hint), to modify the training code so as to replicate the improvement leading to $R_{i}$ .

Agents are evaluated by running the modified code on standardized hardware (e.g., single 8xH100 node with PyTorch), ensuring that timing and performance are directly comparable to the original records.

The core metric is the Fraction of Speedup Recovered (FSR): $\mathrm{FSR}_i = \frac{t_{i} - t'_{i}}{t_{i} - t_{i+1}}$ where $t'_i$ is the agent’s achieved runtime, $t_i$ the previous record, and $t_{i+1}$ the human record for the improvement. This metric quantifies an agent’s ability to close the performance gap accomplished by human innovation in each step.

Tasks are run both with and without “hints”—natural language descriptions, pseudocode, or paper-style writeups describing the nature and implementation of the improvement.

2. Scope and Types of Tasks

The ALLMsB comprises 19 primary tasks derived from the NanoGPT Speedrun, excluding framework-only changes. The tasks span a representative range of research advances in large-scale LLM training, including:

Introduction of architectural innovations (rotary embeddings, untied embeddings and heads, U-Net pattern skip connections)
Algorithmic improvements (novel optimizers, momentum warmup, value embedding splits)
Hardware-level changes (migration to bfloat16/FP8, parallelization strategies)
Efficient attention mechanisms (FlexAttention, block sliding window, long-short attention)
Optimized hyperparameter schedules and learn-rate strategies

Each step is grounded in real scientific contributions as validated by the speedrun’s cumulative public advancements.

3. Hint Formats and Communication Simulations

The benchmark is distinguished by the systematic presentation of hints in three formats for each task:

Level 1 (L1): Pseudocode—Algorithmic code-like summaries illuminating the skeletal innovation.
Level 2 (L2): Text Description—Detailed natural language explanations giving context, motivation, and exact changes.
Level 3 (L3): Mini-paper—Simulated scientific publications providing full methodological, tabular, and narrative recapitulation.

Each hint is manually prepared and quality-controlled, enabling analysis of the effect of communication style on LLM-augmented scientific reproduction. This structure allows for controlled comparison between direct coding, guided “engineering log,” and formal research formats, reflecting real-world scientific communication.

4. Performance, Key Results, and Diagnostic Insights

Evaluations reveal that current LLMs, even when scaffolded with multi-step search or debugging, recover only a fraction of the performance delta realized by human engineers—frequently less than 50% FSR—even when provided with explicit pseudocode (Zhao et al., 27 Jun 2025). Best results under flat search with o3-mini using only L1 hints reach $40\%$ FSR; combinations of all hint types can marginally raise this to $46\%$ , though with high variance and diminishing returns as context size increases.

Main challenges observed:

Many agent solutions are buggy or fail to run, especially for later tasks requiring sophisticated, cumulative modifications.
Agents exhibit rapid performance decay when required to chain multiple prior improvements (multi-step reasoning), with a marked inability to recursively build on their own outputs.
Inputting more verbose hints sometimes degrades performance for some models, possibly due to token overload or distraction from key information.
Code-embedding and LLM-judge metrics both positively correlate with FSR, substantiating FSR as a proxy for reproduction fidelity.

Empirical analysis thus demonstrates that LLMs are far from achieving human-level speedrunning across even moderately deep research arcs, and the incremental approach of the ALLMsB provides a sensitive, non-saturated diagnostic for progress.

5. Implications for Automated Research and Reproducibility

The ALLMsB directly measures an LLM’s or agent’s proficiency at the basic, yet foundational, requirement for autonomous science: scientific reproducibility. By structuring tasks as a series of documented, cumulative research improvements and requiring execution-level speedup, the benchmark uncovers bottlenecks in current systems’ capacity to:

Parse and implement complex research communication in multiple forms
Debug, iterate, and recombine improvements in sequence
Select necessary information from among variable hint types
Robustly handle cumulative code complexity and interdependency

These are prerequisites for any credible claim of agentic autonomy in scientific discovery. The benchmark thus becomes an early warning system for the limits of automation in experimental science over and above mere code generation or summarization.

Furthermore, by using real, accessible research artifacts, ALLMsB avoids artificial saturation, ensuring that progress on the benchmark reflects true advancement in LLM agent capabilities rather than overfitting to static or synthetic datasets.

6. Benchmark Implementation and Community Usage

The ALLMsB includes a reproducible pipeline, with all code, data, and evaluation infrastructure publicly provided (see https://github.com/facebookresearch/LLM-speedrunner). Each record is associated with full scripts, hardware and dataset docstrings, and standardized execution and timing protocols. Benchmark runners are designed for fair, direct head-to-head comparison of agent solutions.

Hints are provided in electronic format (Markdown, text), and evaluation utilities for FSR computation, log analysis, and code similarity metrics are included.

The benchmark is extensible to new research domains or agent scaffolds, and can serve as a long-term reporting ground for incremental LLM research progress in scientific reproduction and related agent tasks.

Summary Table: Key Properties of the Automated LLM Speedrunning Benchmark

Aspect	Details
Source	NanoGPT Speedrun: 19 sequential, real-world speedup records
Task Type	Code modification, reproduction of algorithmic/hardware advances
Core Metric	Fraction of Speedup Recovered (FSR)
Hint Types	Pseudocode (L1), text description (L2), mini-paper (L3)
Evaluation	Execution and runtime on standardized hardware
Findings	Current LLMs recover <50% of human speedup, cumulative failure
Implication	Reproducibility is major bottleneck for agentic research
Availability	https://github.com/facebookresearch/LLM-speedrunner

PDF Markdown Chat (Upgrade)

References (1)

1.

The Automated LLM Speedrunning Benchmark: Reproducing NanoGPT Improvements (2025)

Follow-up Questions

We haven't generated follow-up questions for this topic yet.

Generate Now