ALLMSB: Automated LLM Speedrunning Benchmark

Updated 2 October 2025

ALLMSB is a benchmark framework that evaluates LLM agents by requiring the reproduction of sequential, record-setting code improvements from speedrun competitions.
It employs a structured progression of tasks with hints ranging from pseudocode to mini-papers, ensuring agents accurately translate technical descriptions into optimized code.
Performance is quantified using the Fraction of Speedup Recovered (FSR) metric, highlighting challenges in cumulative error propagation and empirical validation.

The Automated LLM Speedrunning Benchmark (ALLMSB) is a framework for evaluating the ability of LLM research agents to reproduce, implement, and improve upon rapid, sequential chains of scientific or engineering advancements, particularly in the domain of LLM development. ALLMSB operationalizes this evaluation by requiring agents to implement record-setting improvements from community-driven speedrun competitions—most notably the NanoGPT speedrun—using only prior source code and structured textual hints as guidance. The overarching goal is to provide a non-saturated, quantitative measure of an agent’s ability to automate scientific reproduction, which is considered a critical capability for the development of autonomous research agents (Zhao et al., 27 Jun 2025).

1. Benchmark Architecture and Design Principles

ALLMSB is instantiated using a collection of speedrun records from active research competitions such as NanoGPT. Each record is comprised of three components: (a) a validated training script, (b) a measured empirical performance (typically wall-clock time to target loss), and (c) a changelog or description detailing the improvement achieved over the previous record. Tasks are constructed as directed transitions between records—i.e., from $R_i$ (previous record) to $R_{i+1}$ (the next state-of-the-art result)—so that the agent must implement a precise set of code-level improvements capable of achieving the reported empirical gain.

Distinctive features include:

Sequential (cumulative) task structure: Each record builds on the previous state, requiring the agent to preserve all prior improvements while incorporating new ones.
Explicit acceptance criteria: Solutions are validated by exact empirical reproduction (measured training time/efficiency on a fixed hardware setup), not by superficial code similarity.
Accessibility and realism: Tasks execute rapidly ( $<$ 45 minutes per record in the NanoGPT use-case), enabling practical experimentation and benchmark extension.

2. Input Modalities and Hint Formats

Each speedrun task provides the agent with the previous record’s codebase plus one optional hint chosen from three granularity levels:

Level 1 (Pseudocode): High-level, language-agnostic summary of the algorithmic change (e.g., Muon optimizer initialization).
Level 2 (Text): Natural language description of the improvement and its expected performance implications.
Level 3 (Mini-paper): Scientific-style summary including detailed rationale, technical analysis, and LaTeX-formatted mathematical formulas (such as learning rate or scheduling equations).

The hints are designed to probe the agent’s ability to translate both algorithmic and implementation details into correct code, from brief sketches to full technical expositions. Agents may also be evaluated with no hints, isolating pure code synthesis capability.

3. Evaluation Metrics and Scoring

The central metric for benchmarking is the Fraction of Speedup Recovered (FSR), defined for each transition as:

$FSR_i = \frac{t_i - t'_{i+1}}{t_i - t_{i+1}}$

where $t_i$ is the wall-clock training time of the previous record, $t_{i+1}$ is the human record for the improved script, and $t'_{i+1}$ is the agent-achieved time after implementing the required changes. The mean FSR over all benchmark tasks serves as the primary aggregate score for agent performance.

Additional evaluations may include:

Code similarity: $L_2$ distance in code embedding space between agent and human solutions.
Error analysis: Rates and propagation of implementation bugs across chained records.

Notably, execution-based validation is required: only agent scripts that reach the target loss within hardware and runtime constraints are considered valid for scoring.

4. Experimental Results and Empirical Findings

Empirical studies reveal key insights into current LLM capabilities:

State-of-the-art reasoning models (DeepSeek-R1, o3-mini, Gemini-2.5-Pro, Claude-3.7-Sonnet) when combined with sophisticated agent scaffolds (AIDE, iterative search) typically recover less than 20% FSR without hints.
Providing detailed hint formats improves performance: some models recover 40–46% FSR for individual record reproduction tasks.
Branching and debugging scaffolds (AIDE variants) reduce buggy solution nodes but do not fully solve code translation and debugging challenges.
Performance degrades substantially in cumulative experiments, as implementation errors compound across sequential tasks, underscoring the difficulty of chained scientific reproduction.

These results demonstrate that the benchmark is not saturated and remains a credible challenge for research agents.

5. Relevance to Autonomous Research Agent Development

ALLMSB addresses a crucial bottleneck in the automation of scientific progress: the reliable, executed reproduction of prior work, not merely textual or code summary generation. Effective agents must implement algorithmic improvements from minimal description, validate their empirical correctness, and chain successive improvements with minimal error propagation.

A plausible implication is that achieving high FSR consistently across ALLMSB tasks constitutes a necessary—but not sufficient—criterion for research automation in domains characterized by cumulative methodological advancement. The benchmark’s structure provides clear differentiation: a system that merely generates plausible code may achieve low FSR due to subtle but critical semantic errors; only robust, implementation-driven reasoning achieves high scores.

The ALLMSB paradigm shares foundational similarities with other benchmarking and evaluation initiatives:

LLMeBench (Dalvi et al., 2023): While LLMeBench offers rapid customization and in-context learning for general NLP benchmarking, its focus is flexibility over speedrun validation; ALLMSB emphasizes sequential, empirical record reproduction.
Project MPG (Spangher et al., 28 Oct 2024): Aggregates “Goodness” (accuracy) and “Fastness” (QPS/speed) metrics for interpretable model comparison. MPG’s Bayesian aggregation framework could extend ALLMSB by integrating uncertainty measures across empirical speedrun records.
BenchAgents (Butt et al., 29 Oct 2024): Multi-agent systems for automated benchmark creation with explicit constraint verification and structured planning—complementary for constructing synthetic speedrunning benchmarks with high diversity.
OSS-Bench (Jiang et al., 18 May 2025): Automates coding benchmarks based on live OSS projects with metrics for compilability, functional correctness, memory safety, and fuzzing-derived robustness. Incorporates chained metric computation analogous to ALLMSB’s FSR calculation and enforces non-triviality via dissimilarity bonuses.

A plausible implication is that integration of reliability (OSS-Bench), multi-agent planning (BenchAgents), and statistical aggregation (Project MPG) methodologies into ALLMSB could further generalize the framework for domains beyond LLM training.

7. Technical Limitations and Open Challenges

Current agent systems struggle to deliver high FSR, especially in cumulative multi-step reproduction scenarios, primarily due to:

Incomplete or imprecise translation of textual hints into actionable code modifications.
Limited or non-robust debugging and validation routines within scaffolded agent architectures.
Rapid compounding of minor errors in sequential tasks.

This suggests that improvements in interpretive reasoning, semantic understanding of scientific innovations, and agent-level debugging are required to meaningfully advance autonomous research automation as measured by ALLMSB.

Table: Key Components of ALLMSB

Benchmark Component	Description	Technical Role
Speedrun Records	Empirically validated code improvements (script, wall time, log)	Defines sequential tasks; ground truth for reproduction
Hint Formats	Pseudocode/Text/Mini-paper descriptions	Vary input specificity; probe agent interpretive capabilities
FSR Metric	Fraction of Speedup Recovered	Quantifies empirical success in implementation
Sequential Chaining	Each record builds on last	Tests accumulated reasoning and error propagation
Agent Scaffold	Iterative code gen+debug+validation (e.g., AIDE, branching)	Supports search, debugging, and execution-based validation

In summary, the Automated LLM Speedrunning Benchmark operationalizes empirical reproduction of scientific record chains as a core competency for autonomous research agents. It provides a sequential, hint-driven task architecture paired with execution-based scoring (FSR), exposing current limitations and informing future research directions at the intersection of LLMs, programming agents, and empirical validation (Zhao et al., 27 Jun 2025).