Papers
Topics
Authors
Recent
2000 character limit reached

Model Averaging for LLM Curriculum Pretraining

Updated 1 December 2025
  • Model averaging is a technique that combines late-stage checkpoints through arithmetic or exponential moving averages to counteract reduced gradient impact from high-quality data.
  • EMA with a decay of approximately 0.2 applied over the last 6 checkpoints effectively restores curriculum benefits, often surpassing traditional warmup–stable–decay methods.
  • Integrating model averaging with moderate learning rate decay ensures that crucial signals from high-quality samples sustain their influence in the final model.

Model averaging refers to post-hoc strategies that synthesize multiple model checkpoints (usually from late-stage training) into a final set of weights, typically via arithmetic averaging. In curriculum-based pretraining, especially for LLMs, model averaging is employed to address the incompatibility between curriculum data ordering and standard learning rate (LR) decay schedules, with empirical and theoretical evidence showing that model averaging can recover or even surpass the gains from curriculum pretraining that would otherwise be lost with aggressive LR decay (Luo et al., 24 Nov 2025).

1. Principle of Model Averaging in Curriculum-Based Pretraining

Model averaging is motivated by the empirical observation that curriculum ordering—in which data is presented from low to high “quality” or “difficulty” according to a metric—yields clear benefits when optimization uses a constant LR, but that these gains disappear with the standard warmup–stable–decay schedules, because gradients from high-quality data at the end of training are heavily downweighted (as the LR approaches zero) (Luo et al., 24 Nov 2025). To address this, model averaging produces a final checkpoint by weighted averaging over several late-stage checkpoints: θavg=i=1KwiθT(Ki)s,iwi=1\theta_{\text{avg}} = \sum_{i=1}^{K} w_i \theta_{T - (K - i)s}, \quad \sum_i w_i = 1 where θT(Ki)s\theta_{T - (K - i)s} are model parameters at late checkpoints and wiw_i are weights (uniform for simple moving average (SMA), exponentially decreasing for EMA, or computed according to the LR for weighted moving average (WMA)).

2. Formal Model Averaging Strategies

Three common formulations are highlighted:

  • Simple Moving Average (SMA): Equal weights, wi=1/Kw_i = 1/K
  • Exponential Moving Average (EMA): Recursively

θ^(i)=αθT(Ki)s+(1α)θ^(i1)\hat{\theta}^{(i)} = \alpha \theta_{T - (K - i)s} + (1 - \alpha)\hat{\theta}^{(i-1)}

with 0<α<10 < \alpha < 1

  • Weighted Moving Average (WMA): wiη(ti)η(ti+1)w_i \propto \eta(t_i) - \eta(t_{i+1}), reflecting the LR schedule

In the context of curriculum-based LLM pretraining under constant LR, EMA with α0.2\alpha \approx 0.2 over the last K=6K = 6 checkpoints is recommended (“CMA” in the paper) (Luo et al., 24 Nov 2025).

3. Empirical Results: Model Averaging Restores Curriculum Gains

When employing a curriculum (ascending data-quality order C+(D)C_+(D)) and a constant LR, model averaging nearly doubles the improvement in downstream validation average over random data ordering. Specifically, combining moderate LR decay and model averaging (termed “CDMA”: moderate decay ++ EMA on last checkpoints ++ ascending curriculum order) achieved a +1.64 point (3.3%) increase in average downstream benchmark score vs. warmup–stable–decay with random shuffling, and +1.20/+1.68 points in mid-training curriculum applications (Luo et al., 24 Nov 2025). These results are obtained on a 1.5B-parameter model with 30B tokens of pretraining data.

Empirically, curriculum pretraining benefits are strongest when the highest-quality data is not underweighted—either by maintaining a higher final learning rate or by averaging model parameters from late-stage checkpoints to counterbalance diminished LR effects. This is rigorously validated in both standard and "mid-training" (multi-phase) curriculum settings.

The effectiveness of model averaging for curriculum-ordered LLM pretraining is contingent on matching the optimization schedule with the data curriculum:

  • Always tune LR schedule together with data order: Aggressive LR decay designed for uniform shuffled data suppresses the gradient signal from late-stage, high-quality samples, negating curriculum effects.
  • Apply moderate LR decay (ηTη0/3\eta_T \approx \eta_0/3) instead of aggressive decay (ηTη0\eta_T \ll \eta_0).
  • Perform moving average over the final KK checkpoints: Use EMA (with decay α0.2\alpha \approx 0.2) or SMA over the last K=6K = 6 checkpoints as a default.
  • In multi-phase curricula, apply averaging in each high-quality phase.

Algorithmic steps:

  1. Sort the dataset DD by data quality metric Q(x)Q(x) to obtain C+(D)C_+(D).
  2. Train with constant or moderate LR on C+(D)C_+(D).
  3. Save last KK checkpoints.
  4. Compute final model as θavg=EMA/SMA/WMA\theta_\text{avg} = \text{EMA/SMA/WMA} over final checkpoint sequence (Luo et al., 24 Nov 2025).

5. Theoretical Justification

The LR acts as an importance weight for each sample's gradient. Standard curriculum schedules send the best data to the end of training; when LR decays aggressively toward zero, late samples contribute little to the final weights. Model averaging sidesteps this problem by synthesizing parameter states from the high-LR regime, effectively reweighting the high-quality data's influence on the final model. This principle holds for both statically sorted curricula and dynamic or multi-phase curriculum training (Luo et al., 24 Nov 2025).

Model averaging is complementary to, and frequently more effective than, pure curriculum pacing or hand-tuned schedules in LLM pretraining. While data-centric approaches such as influence-driven ordering (Schoenegger et al., 21 Aug 2025), irreducible curriculum (Fan et al., 2023), and preference-based scheduling (Zhang et al., 21 Jan 2025) focus on optimizing the data sequence, model averaging ensures that the optimization schedule does not erase the benefits provided by such curricula.

Selected integration points: | Curriculum Paper | Model Averaging Addressed | Integration Context | |------------------------------|--------------------------|-----------------------------------------| | (Luo et al., 24 Nov 2025) | Central | Curriculum-based LLM pretraining | | (Zhang et al., 21 Jan 2025, Fan et al., 2023) | Not primary | Could benefit from checkpoint averaging | | (Zhang et al., 12 Jun 2025, Lin et al., 11 Mar 2024) | Not implemented | Focus on data ordering and pacing |

In summary, model averaging is a critical optimization strategy in curriculum-based large-model pretraining. When combined with appropriate data ordering and moderate LR decay, it unlocks consistent improvements in both convergence speed and downstream accuracy by ensuring that the final model reflects the strongest signals from high-quality data, regardless of late-stage learning rate dynamics (Luo et al., 24 Nov 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Model Averaging.