On the Origin of Algorithmic Progress in AI (2511.21622v1)
Abstract: Algorithms have been estimated to increase AI training FLOP efficiency by a factor of 22,000 between 2012 and 2023 [Ho et al., 2024]. Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than 10x of these gains. Surveying the broader literature, we estimate that additional innovations not included in our ablations account for less than 10x, yielding a total under 100x. This leads us to conduct scaling experiments, which reveal that much of this efficiency gap can be explained by algorithms with scale-dependent efficiency improvements. In particular, we conduct scaling experiments between LSTMs and Transformers, finding exponent differences in their compute-optimal scaling law while finding little scaling difference for many other innovations. These experiments demonstrate that - contrary to standard assumptions - an algorithm's efficiency gains are tied to compute scale. Using experimental extrapolation and literature estimates, we account for 6,930x efficiency gains over the same time period, with the scale-dependent LSTM-to-Transformer transition accounting for the majority of gains. Our results indicate that algorithmic progress for small models has been far slower than previously assumed, and that measures of algorithmic efficiency are strongly reference-dependent.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper tries to answer a big question in AI: where do the huge efficiency gains in training AI models really come from? Many people say that smarter algorithms have made training far more efficient over the last decade. The authors check this claim and find that most of the giant efficiency jump mainly comes from a few big changes that only show their full power at large scales, not from lots of small tweaks.
What questions did the researchers ask?
They focused on simple, practical questions:
- Which specific algorithm changes made training AI models more efficient?
- How big are those efficiency gains?
- Do these gains stay the same for small and large models, or do they grow with more compute (computer power)?
- Can we explain the large reported efficiency gains with actual experiments?
- Does it matter what “baseline” algorithm you compare against when measuring progress?
How did they paper it?
They used three approaches. Here are the technical ideas explained in everyday terms:
Small “ablation” experiments (swapping parts)
Think of a neural network like a car with many parts. They swapped one part at a time—like the activation function, the optimizer, or how positions are encoded—to see how much each part helps. This is called an “ablation” experiment. They measured how much less compute (computer work) you need to get the same performance after each change.
- Compute is measured in FLOPs (how many tiny math operations the computer does).
- Performance was measured using loss/perplexity (how well the model predicts text).
Scaling experiments (testing small vs big)
They compared two types of model “engines”:
- LSTMs: an older type of neural network.
- Transformers: the newer type used in modern AI like GPT.
They trained both at different sizes and compute budgets to see how performance improves as you throw more compute at them. This reveals “scaling laws”—patterns that show how performance improves with more data, parameters, and compute.
- A key idea: some algorithms get relatively better as you scale up. That means their efficiency gain isn’t constant; it grows with model size and compute.
Theory about data vs parameters (Kaplan vs Chinchilla)
This part is about how to split your compute between model size (parameters) and training data:
- Kaplan’s earlier advice leaned toward scaling parameters faster than data.
- Chinchilla’s later advice says you should scale data and parameters together (roughly equally) for best results.
They show that switching from Kaplan-style to Chinchilla-style training gives extra efficiency—especially at larger scales—even if the basic model stays the same.
What did they find and why it matters?
Here are the main takeaways in plain language:
- Most small tweaks (like changing the activation function, normalization, learning rate schedule, etc.) give only modest efficiency gains—often less than 2×—and these gains don’t grow much with scale.
- Two changes dominate the overall progress:
- Switching from LSTMs to Transformers gives much bigger gains at larger scales. The Transformer “engine” outscales the LSTM as you increase compute.
- Switching from Kaplan-style to Chinchilla-style scaling (balancing data and parameters) adds meaningful efficiency, especially as models get bigger.
- When they add everything up, they can explain about 6,930× efficiency gains from 2012 to 2023, and most of this comes from the two scale-dependent changes above. That’s smaller than another paper’s 22,000× claim, but still huge.
- Measuring “algorithmic efficiency” depends on your reference point:
- If you measure progress relative to LSTMs, it looks explosive.
- If you measure relative to dense Transformers (and you just add Mixture-of-Experts, for example), the gains may look flat.
- So “how fast are we improving?” depends a lot on what you’re comparing against.
Why this matters: It changes how we think about progress. It’s not that lots of little tweaks added up to massive gains; it’s mainly that a few big shifts (architecture and scaling strategy) pay off more and more as compute increases. That suggests you need large compute to fully benefit from “better algorithms.”
What does this mean for the future?
- Compute drives algorithmic gains: Big improvements often need big compute to show up. So continuing to invest in larger training runs is key to unlocking future gains.
- Small models benefit less: If you only train small models, many of the famous improvements won’t help as much. This favors groups who can afford large-scale training.
- Be careful when measuring “progress”: The number you report depends on the baseline algorithm you choose. Two groups can honestly report very different progress depending on what they compare to.
- Don’t just multiply gains: Combining lots of small improvements doesn’t simply stack up like 2× × 3× × 1.5×. Parts interact, and the total is often less than the simple multiplication suggests.
- Planning and policy: If progress depends heavily on compute scale, that affects how researchers, companies, and funders plan projects, share resources, and set expectations. It also suggests that making compute more available could accelerate useful algorithmic discoveries.
In short: The biggest AI training efficiency gains mainly come from a few scale-dependent changes—like moving from LSTMs to Transformers and adopting Chinchilla-style scaling—and these gains grow with compute. Many smaller tweaks help a bit, but they don’t explain the huge leaps by themselves.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a single, consolidated list of specific gaps and open questions that remain unresolved and could guide future research.
- Quantify the scale-dependent training efficiency and scaling exponents of Mixture-of-Experts (MoE) models using controlled, iso-performance/isoFLOP comparisons across 1018–1024 FLOPs; standardize routing, load-balancing, and expert activation to separate training vs inference effects.
- Establish tokenizer-independent evaluation protocols (e.g., bits-per-byte, byte-level language modeling) to rigorously measure tokenization’s contribution to CEG functions and scaling exponents; report results across scales and sequence lengths.
- Systematically measure data selection, deduplication, filtering, and curriculum effects on neural scaling exponents and CEG functions; run controlled ablations across common corpora (e.g., The Pile, RefinedWeb, DCLM) with standardized tokenizers and metrics.
- Test whether modernized LSTMs (e.g., xLSTM, Mogrifier LSTM, RHN, GRU variants) with contemporary training tricks (normalization, better initialization, longer sequences) alter the LSTM–Transformer exponent gap under Chinchilla-optimal scaling.
- Identify the causal drivers of the LSTM→Transformer exponent difference by isolating architectural factors (attention depth/width, head dimension, recurrence length) and training factors (optimizer, initialization, batch size) via factorial experiments.
- Validate the claimed scale invariance of “small” innovations (RMSNorm, SwiGLU, rotary embeddings, LR schedules) at much larger model sizes and context lengths (e.g., ≥128k tokens), explicitly testing sequence-length-dependent gains.
- Develop a formal framework to quantify sub-multiplicative, non-orthogonal interactions among algorithms; design factorial/combination ablation studies to estimate joint effects rather than multiplying independent CEG multipliers.
- Create a compute-budgeted, architecture-agnostic hyperparameter optimization (HPO) protocol to ensure fair, stable training across scales; quantify how HPO choices change CEG estimates and scaling exponents.
- Test the sensitivity of scaling exponent and CEG estimates to the assumed irreducible loss (set to 1.9 in this work); re-estimate using multiple tokenizers/datasets and report uncertainty bands.
- Extend the scaling and CEG analyses beyond language modeling to vision, speech, RL, and multimodal settings to assess whether scale-dependent gains similarly dominate in other domains.
- Define and measure a multidimensional CEG that includes wall-clock time, energy, memory footprint, and inference cost; quantify trade-offs for algorithms like MoE, FlashAttention, and reasoning-optimized models.
- Formalize reference-dependence in CEG reporting: propose standard reference algorithms/baselines, guidelines for selection, and methods to compare CEG across changing baselines while correcting for compute scaling confounders.
- Reconcile the 22,000× vs 6,930× discrepancy by measuring missing contributors under iso-performance conditions: tokenization advances, data curation, regularization, distributed training (ZeRO, pipeline), compiler/kernel-level optimizations, and memory-saving techniques.
- Investigate whether SGD (or quasi-SGD with small batches and high momentum) can be made stable and compute-optimal at large scales; characterize the critical batch size and optimizer-dependent scaling behavior across architectures.
- Empirically validate the analytic Kaplan→Chinchilla CEG function across scales (1016–1024 FLOPs), including the role of embedding parameters, warmup length, and optimizer-specific effects; report where convergence→divergence transitions occur.
- Benchmark architectural alternatives (SSMs/Mamba, KANs, recurrence-augmented Transformers) under uniform experimental setups and Chinchilla-optimal scaling to determine exponent differences and potential surpassing of Transformers.
- Study context-window scaling explicitly: isolate how positional encodings and attention variants affect scaling exponents as sequence length grows, controlling for data size and tokenizer.
- Standardize FLOP accounting, dataset/tokenizer reporting, and evaluation thresholds; release open benchmark suites and code/configurations enabling reproducible estimation of CEG functions and exponent fits.
- Define domains of validity for CEG functions and methods to handle cases where performance plateaus render CEG undefined (i.e., “infinite” compute requirements); provide decision rules for extrapolation vs truncation.
- Distinguish algorithmic vs hardware/implementation gains (kernel fusion, sparsity kernels, memory optimizations) in “effective compute” measurements; either incorporate or control for these factors in CEG to avoid conflation.
- Causally disentangle whether compute growth drives measured algorithmic progress or vice versa: build time-series models correcting for observational bias (correlation between compute and efficiency) and instrument for exogenous shocks.
- Identify algorithms that deliver scale-agnostic gains beneficial to low-compute practitioners (e.g., curriculum strategies, data augmentation, parameter sharing); measure their CEG at small scales.
- Integrate reasoning models into the CEG framework by defining metrics that account for chain-of-thought inference compute; quantify training vs inference efficiency and their scaling behavior relative to non-reasoning baselines.
- Assess robustness of conclusions to token-to-parameter ratios beyond the fixed value of 20; map how this ratio affects compute-optimality, scaling exponents, and CEG across architectures.
- Characterize optimizer batch-size scaling laws and critical batch sizes across architectures; determine whether optimizer choice subtly affects exponents at the frontier despite appearing scale-invariant at small scales.
- Analyze embedding parameter treatment at small scales and its impact on exponent fits and CEG estimation; standardize embedding accounting practices across studies.
Glossary
- Ablation experiments: Controlled removal or substitution of components to measure their individual effect on performance or efficiency. "Running small-scale ablation experiments on key innovations from this time period, we are able to account for less than of these gains."
- Adam: A first-order optimizer with adaptive moment estimation commonly used for training neural networks. "Prior work has shown that SGD performs notably worse on transformers in comparison to Adam."
- AdamW: A variant of Adam that decouples weight decay from the gradient update for improved regularization. "We compare Adam to AdamW but find a negligible difference for our model."
- Chinchilla re-balancing: The shift to equal scaling of parameters and data (Chinchilla-optimal) rather than parameter-heavy scaling, improving efficiency. "We find two strongly scale-dependent algorithmic innovations: LSTMs to Transformers, and Kaplan to Chinchilla re-balancing."
- Chinchilla scaling: An approach that scales data and parameters proportionally to compute for optimal training efficiency. "Accounted for all parameters and found that data and parameters should be scaled equally while scaling compute (so-called ``Chinchilla-scaling")."
- Compute equivalent gain (CEG) function: A function f(C) defining how much less compute a new algorithm needs across scales to match a baseline’s performance. "We define the CEG function and CEG multiplier distinctly:"
- Compute equivalent gain (CEG) multiplier: The value f(C) for a fixed compute C, indicating the relative efficiency gain of one algorithm over another at a set performance level. "Then we call the CEG multiplier of relative to ."
- Compute equivalent gains (CEG) framework: A methodology for quantifying algorithmic efficiency improvements as compute-equivalent gains. "We introduce a generalized notion of the standard compute equivalent gains (CEG) framework."
- Compute frontier: The leading edge of models’ training compute over time, used as a reference to compare progress. "We use the exponential relationship between time and compute among notable models~\citep{epoch2023aitrends} and refer to this as the ``compute frontier"."
- Critical batch size: The largest batch size beyond which training no longer improves throughput or model quality efficiently. "Adam is compatible with much higher critical batch sizes than SGD."
- Flash Attention: An attention implementation optimized to reduce memory access and wall-clock time without changing FLOPs. "Other innovations like Flash Attention reduce wall-clock time and energy consumption without affecting total FLOP operations."
- FLOPs: Floating point operations; a measure of computational cost, often used to quantify training efficiency. "Supposedly allowing the same performance level with drastically fewer FLOPs."
- Hessian heterogeneity: Variability in curvature across parameters that affects optimizer performance and stability. "This is theorized to be due to Hessian heterogeneity \citep{zhang2024transformers}."
- Irreducible loss: The minimum achievable loss due to intrinsic data/model constraints, subtracted when fitting scaling laws. "We subtract the irreducible loss component (1.9)."
- Isoflop curves: Performance comparisons holding compute (FLOPs) constant across different model configurations or architectures. "\citet{he2024mixture} looks at isoflop curves for dense and MoE models."
- Kaplan scaling laws: Early scaling guidelines that advocated less aggressive data scaling relative to parameters, later found suboptimal. "Kaplan misestimated scaling law exponents due to small sizes, not accounting for embedding parameters, and using a constant warmup size."
- Kaplan Transformers: Transformer models adjusted to follow Kaplan-style scaling rather than Chinchilla-optimal scaling. "We also define ``Kaplan" Transformers to be our Retro Transformer adjusted analytically to approximate Kaplan scaling."
- Kolmogorov-Arnold Networks (KANs): A proposed architecture class claimed to have steeper scaling behavior than Transformers. "KANs (Kolmogorov-Arnold Networks) that are purported to have a larger/steeper scaling exponent."
- Layer normalization (LayerNorm): A normalization method applied across features within a layer to stabilize training. "Pre-layernorm(RMSNorm) applies layernorm before multi-head attention in the residual stream."
- Long Short-Term Memory (LSTM): A recurrent neural network architecture designed to capture long-range dependencies via gating mechanisms. "We implement two main model architectures to serve as baselines in our experiments: LSTMs and transformers."
- Mixture of Experts (MoE): Architectures that route inputs to specialized expert subnetworks to improve efficiency and capacity. "MoE is generally considered a substantial inference improvement, but MoE architectures do significantly improve training performance as well."
- Negative log-likelihood (NLL): A loss function commonly used for probabilistic models and language modeling. "Namely ."
- Neural scaling laws: Empirical relationships describing how performance improves with increased compute, data, or parameters. "The theory literature has focused on the intrinsic properties of training data as the most important factor in determining the exponent in neural scaling laws."
- Pareto-optimal (frontier) points: Points along the best achievable trade-off curve (e.g., loss vs. compute) where no improvement is possible without worsening another metric. "Fit a power law to all Pareto-optimal (frontier) points from FLOP onward."
- Perplexity: A metric for LLM performance measuring how well the model predicts a sample; lower is better. "Perplexity is inherently tokenizer-dependent."
- Positional encodings: Methods for injecting sequence position information into models, such as sinusoidal or rotary encodings. "We test three different types of positional encoding: rotary, sinusoidal, and learned positional encodings."
- Power law: A functional relationship where one quantity varies as a power of another, often used to fit scaling curves. "Fit a power law to all Pareto-optimal (frontier) points from FLOP onward."
- Pre-layer normalization (Pre-LN): Applying normalization before the attention or feedforward sublayers for stability. "Pre-layernorm(RMSNorm) applies layernorm before multi-head attention in the residual stream."
- Pre-RMSNorm: Applying RMSNorm before sublayers, analogous to pre-LayerNorm but using RMS normalization. "We observe substantial improvements of #1{} transitioning from post-RMSNorm to pre-RMSNorm."
- RMSNorm: A normalization technique that scales by root mean square without mean-centering, offering efficiency benefits. "RMSNorm is an update to layernorm that removes the mean centering step and can have runtime improvements of ."
- Rotary encodings: Position encoding method (RoPE) enabling flexible and long context handling via rotational embedding operations. "We estimate that rotary encoding constitutes an improvement of in training efficiency over sinusoidal encoding."
- Scaling experiments: Empirical studies that vary compute, data, and model size to assess performance trends and exponents. "We conduct scaling experiments to measure differences in optimal scaling across architectures."
- Scaling exponent: The exponent in a scaling law that quantifies the rate at which performance improves with increased resources. "Transformers have a steeper scaling exponent than LSTMs."
- SGD: Stochastic Gradient Descent, a first-order optimizer often used with momentum; can underperform Adam in Transformers. "Prior work has shown that SGD performs notably worse on transformers in comparison to Adam."
- SwiGLU: An activation function variant that often improves training efficiency over GeLU in Transformers. "SwiGLU, has a measurable efficiency gain over GeLU even taking into account its increased parameter requirements."
- Token-to-parameter ratio: The ratio of training tokens to model parameters, relevant for compute-optimal training. "All of our ablation experiments use a token-to-parameter ratio of 20."
- Transformer: A neural architecture leveraging attention mechanisms, now dominant for language modeling. "We implement two main model architectures to serve as baselines in our experiments: LSTMs and transformers."
- Variational dropout: A dropout technique that applies consistent masks across time or dimensions for RNNs. "We include a variational dropout with separate dropout rates for embedding matrices vs hidden matrices."
- Width-to-depth ratio: The relative scaling of layer width (hidden size) versus depth (number of layers) in model architecture. "We use a vanilla transformer with rotary-based encodings and keep a constant width-to-depth aspect ratio."
- Xavier initialization: A weight initialization scheme designed to maintain signal variance across layers. "Xavier-uniform embedding weights."
Practical Applications
Immediate Applications
Below are practical uses that organizations and individuals can deploy now, based on the paper’s empirical findings, generalized CEG framework, and scaling analyses.
- Compute-aware training planners for Chinchilla-optimal scaling (software/AI labs)
- Application: Build internal tools that set tokens-to-parameters ratios and training budgets to follow equal-scale data/parameter allocation (Chinchilla) rather than under-scaling data (Kaplan).
- Potential tools/workflows: “Chinchilla Planner” that ingests available compute, target loss, and dataset/tokenizer information to recommend parameter count, dataset size, and schedule.
- Dependencies/Assumptions: Task evaluated via perplexity or comparable loss; sufficient data availability; results assume Chinchilla-optimal regime is applicable to your domain/data.
- Reference-aware benchmarking dashboards (academia, MLops)
- Application: Report compute-equivalent gain (CEG) multipliers relative to explicit reference algorithms (e.g., LSTM, dense Transformer) and compute scale to avoid misleading comparisons.
- Potential tools/workflows: Model-evaluation dashboards with selectable baselines, compute scales, and visualized CEG functions.
- Dependencies/Assumptions: Consistent datasets/tokenizers across runs; community adoption of reference-dependent reporting.
- Architecture selection for small-scale/edge deployments (robotics, IoT, mobile)
- Application: For tight compute budgets, evaluate LSTM vs “Retro” Transformer vs “Modern” Transformer, prioritizing training stability improvements (e.g., pre-RMSNorm) over FLOP-only gains.
- Potential tools/workflows: Lightweight architecture decision checklists for embedded teams; small-scale hyperparameter templates.
- Dependencies/Assumptions: Edge tasks may not share scaling regimes with internet-scale pretraining; memory/latency constraints dominate.
- Training-stability upgrades with small FLOP gains (software/AI labs)
- Application: Adopt pre-RMSNorm, rotary encodings (for long contexts), and robust learning-rate schedules to reduce instability and convergence failures despite modest FLOP savings.
- Potential tools/workflows: “Stability-first” training templates; automated LR schedule selection; sanity-check regressions for norm and positional choices.
- Dependencies/Assumptions: Benefits are larger on fragile training regimes; sequence-length dependence for positional encodings.
- Tokenizer efficiency audits (software/AI labs, MLOps)
- Application: Evaluate tokenization efficiency (e.g., perplexity-per-byte) to avoid the reported ~68% training cost overhead from inefficient tokenizers.
- Potential tools/workflows: Tokenizer benchmarking harnesses; migration plans to better subword schemes; byte-level metrics alongside standard perplexity.
- Dependencies/Assumptions: Perplexity is tokenizer-dependent; results vary by language/domain; alternative metrics required for principled comparisons.
- Pragmatic expectations and ROI modeling for compute vs “algorithmic progress” (finance, enterprise IT)
- Application: Treat capability gains as primarily compute-driven; plan budgets and timelines around compute growth rather than assuming large algorithmic leaps.
- Potential tools/workflows: CFO-friendly ROI calculators embedding ~2.2× annual effective efficiency growth from algorithmic changes and ~4.2×/year compute frontier growth.
- Dependencies/Assumptions: Cloud pricing trends, hardware roadmaps, and task-specific returns may deviate; the paper’s rates are historical estimates.
- Cloud/HPC capacity planning aligned with scale-dependent gains (energy/HPC operators)
- Application: Prioritize resource allocation for workloads that realize scale-dependent benefits (e.g., larger dense Transformers) and assess MoE training benefits realistically (~≤2× FLOP efficiency).
- Potential tools/workflows: Queue and quota policies favoring jobs that hit scaling “sweet spots”; inference-focused MoE deployments for throughput.
- Dependencies/Assumptions: Routing quality for MoE; domain-specific performance targets; available long-duration compute windows.
- Publication and review protocols to standardize efficiency reporting (academia, policy)
- Application: Require authors to state compute budgets, reference algorithms, and scale ranges when reporting CEG multipliers.
- Potential tools/workflows: Conference/journal checklists; model cards with “reference-dependent” efficiency disclosures.
- Dependencies/Assumptions: Community buy-in; alignment across venues and standards bodies.
- Architecture decision memos for MoE training vs inference (software/AI product teams)
- Application: Recognize MoE training gains are modest and uncertain; pursue MoE primarily for inference throughput and latency advantages rather than training FLOP savings.
- Potential tools/workflows: Architecture trade-off templates; routing-loss monitoring; isoflop comparison protocols.
- Dependencies/Assumptions: Task fit for sparse activation; quality of expert routing; inference cost constraints.
- Energy and wall-clock tracking orthogonal to FLOP metrics (sustainability/ops)
- Application: Track energy and time savings separately (e.g., Flash Attention) even when FLOP counts are unchanged.
- Potential tools/workflows: Energy dashboards; kernel-level optimizations; hardware-aware training/inference profiles.
- Dependencies/Assumptions: Hardware support (e.g., CUDA kernels); energy metering at job level; FLOP ≠ energy/time in practice.
- Data–parameter balancing checks in existing pipelines (MLops)
- Application: Audit current training runs to ensure tokens and parameters are scaled in line (Chinchilla-optimal) rather than under-scaling data (Kaplan practice).
- Potential tools/workflows: Pretraining readiness checks; data budget calculators; sampling strategies to reach target token counts.
- Dependencies/Assumptions: Availability of high-quality data; privacy/licensing constraints; tokenizer and deduplication quality.
- Internal training for ML teams on scale dependence and sub-multiplicative interactions (education, enterprise training)
- Application: Update curricula to emphasize that many “small” algorithms offer modest, scale-invariant gains and interact sub-multiplicatively.
- Potential tools/workflows: Workshops; onboarding modules; case studies reproducing the paper’s ablations.
- Dependencies/Assumptions: Willingness to prioritize rigorous benchmarks over anecdotal claims; datasets/hardware to run small ablations.
Long-Term Applications
Below are applications that require more research, standardization, scaling, or development before broad deployment.
- Open “Algorithmic Progress Observatory” for CEG functions (academia, policy)
- Application: A shared repository of standardized CEG functions and multipliers across algorithms, scales, datasets, tokenizers, and tasks.
- Potential tools/products: Open benchmark suite; interactive explorer for reference-dependent efficiency; annual “state of algorithmic progress” reports.
- Dependencies/Assumptions: Community-maintained datasets; reproducibility standards; sustained funding.
- Compute access policies and public compute commons (policy, government, energy)
- Application: Address the paper’s implication that scale-dependent gains favor large compute holders; create public/academic compute pools and fair-access policies.
- Potential tools/products: National research clouds; grants tied to transparent compute reporting; antitrust guidance on compute concentration.
- Dependencies/Assumptions: Political will; cost-benefit analysis; alignment with energy/grid planning.
- Automated training planners with dynamic scaling-law inference (software tooling)
- Application: Tools that learn task-specific scaling exponents on-the-fly and recommend architecture, data/parameter allocations, optimizer/hyperparameters.
- Potential tools/products: “AutoScale Planner” integrating telemetry, irreducible loss estimates, and uncertainty bounds; continuous optimizer selection.
- Dependencies/Assumptions: Robust estimation under noisy runs; standardized logging; generalization across domains.
- Sector-specific decision playbooks for scale vs specialization (healthcare, finance, education, robotics)
- Application: Formal guides on when to pursue scale-dependent gains vs bespoke small models, reflecting domain data quality, privacy, and compute constraints.
- Potential tools/products: Compliance-aware pretraining strategies; domain tokenization standards; small-model fine-tuning kits.
- Dependencies/Assumptions: Access to domain-curated datasets; legal/regulatory constraints; domain evaluation metrics beyond perplexity.
- Reference-aware model comparison platforms (MLops, evaluation)
- Application: Platforms that let users pick baselines (LSTM, dense Transformer, MoE) and compute budgets to visualize relative progress.
- Potential tools/products: Multi-reference leaderboards; scale-normalized scorecards; CEG-over-time projections.
- Dependencies/Assumptions: Agreement on baselines; curated cross-model metadata; community adoption.
- Tokenizer standards and research programs (academia, tools)
- Application: Standardize tokenizer benchmarking with perplexity-per-byte and task-specific metrics; reduce inefficiency and fragmentation.
- Potential tools/products: Tokenizer registries; cross-lingual efficiency suites; guidance on segmentation for domain-specific corpora.
- Dependencies/Assumptions: Broad multilingual support; cooperation among tool vendors; evaluation beyond English web text.
- Rigorous MoE scaling studies to clarify training gains and exponents (academia, industry R&D)
- Application: Determine whether MoE offers meaningful training exponent changes, under what routing/loss formulations, and at which scales.
- Potential tools/products: Isoflop benchmarking protocols; routing-quality metrics; open-source MoE baselines.
- Dependencies/Assumptions: Large-scale experiments; standardized datasets; stable routing algorithms.
- Architecture exploration beyond Transformers (academia, industry R&D)
- Application: Systematic scaling comparisons (e.g., KANs, xLSTMs) to quantify exponent differences and compute-optimal regimes.
- Potential tools/products: Cross-architecture CEG libraries; reproducible pipelines for exponent estimation; small-to-large scale transfer tools.
- Dependencies/Assumptions: Reproducible implementations; fair tokenizers/datasets; enough compute to observe scaling regimes.
- Compute-aware energy and grid planning (energy sector, government)
- Application: Incorporate expected compute growth (~4.2×/year at the frontier) into regional energy planning, cooling, and data-center siting.
- Potential tools/products: Forecast models; incentives for energy-efficient hardware; renewable energy integration for AI clusters.
- Dependencies/Assumptions: Accurate demand projections; hardware roadmaps; regulatory frameworks.
- Curriculum redesign for ML education (education)
- Application: Integrate scaling-law literacy, reference-dependent metrics, and compute-centered progress into ML courses and bootcamps.
- Potential tools/products: Teaching modules; hands-on labs replicating ablations and scaling fits; ethics modules on compute concentration.
- Dependencies/Assumptions: Faculty training; access to educational compute; alignment with accreditation bodies.
- Model-card standards mandating compute and reference disclosures (policy, standards)
- Application: Require models to disclose total compute, scaling practice (Kaplan/Chinchilla), reference algorithms for CEG, and scale ranges.
- Potential tools/products: Extended model-card schemas; audit guidelines; compliance reporting portals.
- Dependencies/Assumptions: Standards body adoption; tooling support; enforcement mechanisms.
- Strategies for startups and smaller labs (entrepreneurship, strategy)
- Application: Focus on high-quality data, domain specialization, fine-tuning, and reasoning-style enhancements rather than chasing frontier-scale training.
- Potential tools/products: Data curation pipelines; small-model reasoning toolkits; compute cooperatives/shared clusters.
- Dependencies/Assumptions: Access to domain data; viable inference cost controls for reasoning models; collaborative ecosystems.
- Edge-capable reasoning models with controlled inference costs (robotics, education apps)
- Application: Research and develop small-scale reasoning-optimized models to gain capability without massive pretraining budgets.
- Potential tools/products: Budgeted chain-of-thought strategies; distillation of long-thinking into efficient inference; hybrid on-device/cloud reasoning.
- Dependencies/Assumptions: Method maturity; task-appropriate evaluation beyond perplexity; sustainable inference economics.
- End-to-end training simulators for executive planning (enterprise IT, finance)
- Application: Simulate multi-year capability trajectories under different compute budgets, architectures, and scaling practices to support strategy.
- Potential tools/products: Scenario planners; risk/return dashboards; sensitivity analyses (e.g., tokenization, data quality, optimizer choices).
- Dependencies/Assumptions: Reliable models of scaling behavior; integration with procurement and cloud provider pricing; executive literacy on ML metrics.
Collections
Sign up for free to add this paper to one or more collections.