Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 58 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 30 tok/s Pro

GPT-5 High 33 tok/s Pro

GPT-4o 115 tok/s Pro

Kimi K2 183 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

NanoGPT Speedrun

Updated 1 July 2025

The NanoGPT Speedrun is a research challenge and benchmark aiming to minimize wall-clock time to train a GPT-2 model to a target validation loss on real data, serving as a crucible for LLM methods and systems innovation.
Innovations range from advanced optimizers (Muon, Scion) and efficient attention mechanisms (FlexAttention) to mixed-precision training and principled regularization techniques, cumulatively reducing training time significantly.
Formalized as an Automated LLM Speedrunning Benchmark, it evaluates both human and AI agents on their ability to reproduce code-level speedups, revealing gaps in current agent capabilities while providing a metric for progress in automated scientific discovery.

The NanoGPT Speedrun refers simultaneously to a research and engineering effort—and a corresponding community benchmarking suite—that aims to minimize the wall-clock time required to train a GPT-2 model (124M parameters) to a fixed target validation loss on real data. This paradigm encompasses a range of diverse advances in optimization, architecture, regularization, efficient data handling, and hardware utilization. It has become a crucible for LLM methods research, code-level systems innovation, and automated agent benchmarking.

1. Origins and Definition

The NanoGPT Speedrun emerged as an open challenge and code-centric competition focused on rapidly training a GPT-2-sized autoregressive model, using the widely adopted NanoGPT implementation, to a pre-specified validation loss on a standardized dataset (FineWeb). The defining metric is the total wall-clock training time on fixed hardware—typically a single 8×H100 (NVIDIA Hopper) cluster node. The "speedrun" framing encourages iterative, reproducible, and incremental improvements akin to those seen in competitive programming or computer systems benchmarking.

The concept has evolved to cover:

Script-level innovations in model training that reduce time-to-target,
Unified, community-documented "records" where each new solution builds upon the previous,
A formal benchmarking suite (“Automated LLM Speedrunning Benchmark” (Zhao et al., 27 Jun 2025)) designed to test not just human, but agentic (LLM-based) ability to implement and reproduce these improvements.

2. Scope of Innovations

The NanoGPT Speedrun records a sequence of innovations, each representing a measurable reduction in training time. As detailed in the Automated LLM Speedrunning Benchmark (Zhao et al., 27 Jun 2025), these innovations span the following broad categories:

Task Index	Key Innovation	Type
1 → 2	Rotary embeddings, tuned learning rate	Embeddings/Optimizer
2 → 3	Muon optimizer	Optimizer
4 → 5	QK-normalization, zero-init projections, ReLU2	Architecture
5 → 6	Distributed Muon overhead	Parallelization
8 → 9	Value/embedding skip, momentum warmup, logit softcap	Architecture/Opt
9 → 10	bfloat16 activations	Data type
11 → 12	FlexAttention (efficient 64K context attention)	Attention Mechanism
...	...	...

Innovations further include:

Hardware-data type optimizations (FP8, bfloat16, memory pinning),
Enhanced attention mechanisms (windowing, FlexAttention),
Architectural changes (skip connections, untying embedding/output layers, U-net style connections),
Optimizer improvements (e.g., Muon, Scion, Gluon, with operator norm-aware learning rates (Pethick et al., 11 Feb 2025, Riabinin et al., 19 May 2025)),
Scheduler and hyperparameter refinements (trapezoidal LR, logit softcap).

Every step represents a reproducible switch in the training script or underlying model, validated via actual speedup.

3. Benchmark Structure and Adoption

The Automated LLM Speedrunning Benchmark (Zhao et al., 27 Jun 2025) formalizes this sequence as a 19-stage code-reproduction suite for LLM and agent evaluation:

Each stage (“task”) provides the previous record’s code and, optionally, a “hint” (pseudocode/text/mini-paper) explaining the next improvement,
Agents (human or LLM) must produce a script that, when run, recovers the wall-clock speedup associated with the next record,
Success is measured objectively via "Fraction of Speedup Recovered" (FSR):

$\text{FSR}_i = \frac{t_i - t'_{i+1}}{t_i - t_{i+1}}$

Where $t_i$ is the old record’s time, $t_{i+1}$ is the target, and $t'_{i+1}$ is what the agent achieved.

The tasks run in minutes, making the entire benchmark practically accessible for iterative agent and human evaluation.

4. Technical Principles Behind Major Speedups

Key technical ideas that have systematically advanced NanoGPT Speedrun records include:

Optimization and Parameterization Advances: Use of operator norm-based optimizers (Muon, Scion, Gluon), LMO-based optimizers with proven scaling-law transfer (Pethick et al., 11 Feb 2025, Riabinin et al., 19 May 2025), checkpoint averaging (LAWA) for early generalization (Sanyal et al., 2023), and CompleteP for guaranteed non-lazy feature learning and hyperparameter robustness across depth/width (Dey et al., 2 May 2025).
Efficiency in Attention: FlexAttention and other efficient attention mechanisms provide tractable quadratic/linear attention for very long contexts, permitting both effective pretraining and inference at large sequence lengths without memory bottlenecks.
Gradient and Activation Scaling: Schemes ensuring robust signal propagation across depth and width, such as maximal update parameterization and its sparse/generalized variants (SμPar (Dey et al., 24 May 2024)), maintain stable gradients, accelerate convergence, and allow hyperparameter transfer.
Mixed and Low-Precision Training: Training scripts shift activations and weights to bfloat16 or FP8 wherever possible without loss of accuracy, taking advantage of hardware-supported sparse and quantized operations.
Plug-and-Play Regularization: Regularizers rooted in theoretical frameworks, notably optimal control/optimal transport-inspired kinetic energy terms (Kan et al., 16 May 2025), yield substantial reductions in overfitting and parameter count, with proof-based stability and generalization guarantees.
Data and Pipeline Optimization: Adoption of efficient data loaders, memory pinning, and parallelized preprocessing/prefetching further compresses training pipelines.

5. Empirical Results and Impact

Aggregate impact across records includes:

Dramatic reduction in wall-time to the target loss—from hundreds to tens of minutes on the same compute,
Implementation of FlexAttention alone enabled order-of-magnitude longer context window training,
Operator-norm and LMO-based optimizers (Muon, Scion, Gluon) show superior speed and generalization compared to AdamW, with robust hyperparameter transferability,
Plug-and-play OT regularization reduced final test loss by 46% and parameter count by 42% for character-level nanoGPT (Kan et al., 16 May 2025),
SμPar parameterization allowed up to 11.9% relative loss improvement at 99.2% sparsity (Dey et al., 24 May 2024),
Modular dualization (via Newton-Schulz iteration) enabled record-setting speedups (Bernstein et al., 28 Oct 2024).

Each innovation was validated on real runs and, as reported in the benchmark, the cumulative effect is consistently and reproducibly measurable.

6. Reproducibility, Automated Agent Evaluation, and Scientific Automation

The NanoGPT Speedrun, as instantiated by the Automated LLM Speedrunning Benchmark, also functions as a litmus test for LLM-based agents’ scientific reproducibility (Zhao et al., 27 Jun 2025):

When tasked with reconstructing the sequence of improvements, SOTA LLM agents recover less than half of the human-achieved speedup, even when provided with rich hints (pseudocode, descriptive or mini-paper summaries).
Failure rate increases with the complexity and abstraction of improvements, highlighting gaps in code reasoning, implementation, and debugging by LLM agents.
The benchmark's structure and metric (FSR) offer a direct, functional “progress bar” for agentic research: future agent innovations that surpass human records will be objectively attributable and quantifiable.

7. Broader Implications and Future Directions

The NanoGPT Speedrun has catalyzed not only rapid advances in practical LLM training efficiency, but also the emergence of a principled, reproducible curriculum for method development and robotic science agents. Its design—open code, rigidly defined metrics, and granular version control—enables:

Transparent research progress with objectively measurable milestones,
A testbed for future agentic and human-in-the-loop systems to close the reproducibility gap,
A blueprint applicable to faster, greener, and more interpretable large-scale model training.

The emphasis on precise code implementation and measurable speed improvements positions the NanoGPT Speedrun as both an engine and reference point for practical LLM methodology, efficient deployment, and the emerging field of AI-automated science.