Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NanoGPT Speedrun

Updated 1 July 2025
  • The NanoGPT Speedrun is a research challenge and benchmark aiming to minimize wall-clock time to train a GPT-2 model to a target validation loss on real data, serving as a crucible for LLM methods and systems innovation.
  • Innovations range from advanced optimizers (Muon, Scion) and efficient attention mechanisms (FlexAttention) to mixed-precision training and principled regularization techniques, cumulatively reducing training time significantly.
  • Formalized as an Automated LLM Speedrunning Benchmark, it evaluates both human and AI agents on their ability to reproduce code-level speedups, revealing gaps in current agent capabilities while providing a metric for progress in automated scientific discovery.

The NanoGPT Speedrun refers simultaneously to a research and engineering effort—and a corresponding community benchmarking suite—that aims to minimize the wall-clock time required to train a GPT-2 model (124M parameters) to a fixed target validation loss on real data. This paradigm encompasses a range of diverse advances in optimization, architecture, regularization, efficient data handling, and hardware utilization. It has become a crucible for LLM methods research, code-level systems innovation, and automated agent benchmarking.

1. Origins and Definition

The NanoGPT Speedrun emerged as an open challenge and code-centric competition focused on rapidly training a GPT-2-sized autoregressive model, using the widely adopted NanoGPT implementation, to a pre-specified validation loss on a standardized dataset (FineWeb). The defining metric is the total wall-clock training time on fixed hardware—typically a single 8×H100 (NVIDIA Hopper) cluster node. The "speedrun" framing encourages iterative, reproducible, and incremental improvements akin to those seen in competitive programming or computer systems benchmarking.

The concept has evolved to cover:

  • Script-level innovations in model training that reduce time-to-target,
  • Unified, community-documented "records" where each new solution builds upon the previous,
  • A formal benchmarking suite (“Automated LLM Speedrunning Benchmark” (2506.22419)) designed to test not just human, but agentic (LLM-based) ability to implement and reproduce these improvements.

2. Scope of Innovations

The NanoGPT Speedrun records a sequence of innovations, each representing a measurable reduction in training time. As detailed in the Automated LLM Speedrunning Benchmark (2506.22419), these innovations span the following broad categories:

Task Index Key Innovation Type
1 → 2 Rotary embeddings, tuned learning rate Embeddings/Optimizer
2 → 3 Muon optimizer Optimizer
4 → 5 QK-normalization, zero-init projections, ReLU2 Architecture
5 → 6 Distributed Muon overhead Parallelization
8 → 9 Value/embedding skip, momentum warmup, logit softcap Architecture/Opt
9 → 10 bfloat16 activations Data type
11 → 12 FlexAttention (efficient 64K context attention) Attention Mechanism
... ... ...

Innovations further include:

  • Hardware-data type optimizations (FP8, bfloat16, memory pinning),
  • Enhanced attention mechanisms (windowing, FlexAttention),
  • Architectural changes (skip connections, untying embedding/output layers, U-net style connections),
  • Optimizer improvements (e.g., Muon, Scion, Gluon, with operator norm-aware learning rates (2502.07529, 2505.13416)),
  • Scheduler and hyperparameter refinements (trapezoidal LR, logit softcap).

Every step represents a reproducible switch in the training script or underlying model, validated via actual speedup.

3. Benchmark Structure and Adoption

The Automated LLM Speedrunning Benchmark (2506.22419) formalizes this sequence as a 19-stage code-reproduction suite for LLM and agent evaluation:

  • Each stage (“task”) provides the previous record’s code and, optionally, a “hint” (pseudocode/text/mini-paper) explaining the next improvement,
  • Agents (human or LLM) must produce a script that, when run, recovers the wall-clock speedup associated with the next record,
  • Success is measured objectively via "Fraction of Speedup Recovered" (FSR):

FSRi=titi+1titi+1\text{FSR}_i = \frac{t_i - t'_{i+1}}{t_i - t_{i+1}}

Where tit_i is the old record’s time, ti+1t_{i+1} is the target, and ti+1t'_{i+1} is what the agent achieved.

The tasks run in minutes, making the entire benchmark practically accessible for iterative agent and human evaluation.

4. Technical Principles Behind Major Speedups

Key technical ideas that have systematically advanced NanoGPT Speedrun records include:

  • Optimization and Parameterization Advances: Use of operator norm-based optimizers (Muon, Scion, Gluon), LMO-based optimizers with proven scaling-law transfer (2502.07529, 2505.13416), checkpoint averaging (LAWA) for early generalization (2306.03241), and CompleteP for guaranteed non-lazy feature learning and hyperparameter robustness across depth/width (2505.01618).
  • Efficiency in Attention: FlexAttention and other efficient attention mechanisms provide tractable quadratic/linear attention for very long contexts, permitting both effective pretraining and inference at large sequence lengths without memory bottlenecks.
  • Gradient and Activation Scaling: Schemes ensuring robust signal propagation across depth and width, such as maximal update parameterization and its sparse/generalized variants (SμPar (2405.15743)), maintain stable gradients, accelerate convergence, and allow hyperparameter transfer.
  • Mixed and Low-Precision Training: Training scripts shift activations and weights to bfloat16 or FP8 wherever possible without loss of accuracy, taking advantage of hardware-supported sparse and quantized operations.
  • Plug-and-Play Regularization: Regularizers rooted in theoretical frameworks, notably optimal control/optimal transport-inspired kinetic energy terms (2505.13499), yield substantial reductions in overfitting and parameter count, with proof-based stability and generalization guarantees.
  • Data and Pipeline Optimization: Adoption of efficient data loaders, memory pinning, and parallelized preprocessing/prefetching further compresses training pipelines.

5. Empirical Results and Impact

Aggregate impact across records includes:

  • Dramatic reduction in wall-time to the target loss—from hundreds to tens of minutes on the same compute,
  • Implementation of FlexAttention alone enabled order-of-magnitude longer context window training,
  • Operator-norm and LMO-based optimizers (Muon, Scion, Gluon) show superior speed and generalization compared to AdamW, with robust hyperparameter transferability,
  • Plug-and-play OT regularization reduced final test loss by 46% and parameter count by 42% for character-level nanoGPT (2505.13499),
  • SμPar parameterization allowed up to 11.9% relative loss improvement at 99.2% sparsity (2405.15743),
  • Modular dualization (via Newton-Schulz iteration) enabled record-setting speedups (2410.21265).

Each innovation was validated on real runs and, as reported in the benchmark, the cumulative effect is consistently and reproducibly measurable.

6. Reproducibility, Automated Agent Evaluation, and Scientific Automation

The NanoGPT Speedrun, as instantiated by the Automated LLM Speedrunning Benchmark, also functions as a litmus test for LLM-based agents’ scientific reproducibility (2506.22419):

  • When tasked with reconstructing the sequence of improvements, SOTA LLM agents recover less than half of the human-achieved speedup, even when provided with rich hints (pseudocode, descriptive or mini-paper summaries).
  • Failure rate increases with the complexity and abstraction of improvements, highlighting gaps in code reasoning, implementation, and debugging by LLM agents.
  • The benchmark's structure and metric (FSR) offer a direct, functional “progress bar” for agentic research: future agent innovations that surpass human records will be objectively attributable and quantifiable.

7. Broader Implications and Future Directions

The NanoGPT Speedrun has catalyzed not only rapid advances in practical LLM training efficiency, but also the emergence of a principled, reproducible curriculum for method development and robotic science agents. Its design—open code, rigidly defined metrics, and granular version control—enables:

  • Transparent research progress with objectively measurable milestones,
  • A testbed for future agentic and human-in-the-loop systems to close the reproducibility gap,
  • A blueprint applicable to faster, greener, and more interpretable large-scale model training.

The emphasis on precise code implementation and measurable speed improvements positions the NanoGPT Speedrun as both an engine and reference point for practical LLM methodology, efficient deployment, and the emerging field of AI-automated science.