Papers
Topics
Authors
Recent
Search
2000 character limit reached

Warmup-Stable and Merge (WSM) Techniques

Updated 3 July 2026
  • WSM is a paradigm that splits processes into warmup, stable, and merge phases to enhance efficiency, stability, and interpretability.
  • In LLM pre-training, WSM replaces decay schedules with checkpoint averaging, achieving improvements of up to +5.5% on benchmarks.
  • The approach extends to parallel merge algorithms and merge tree metrics, providing near-optimal load balancing and precise structural comparisons.

Warmup-Stable and Merge (WSM) refers to a class of techniques across learning rate scheduling in LLM pre-training (Tian et al., 23 Jul 2025), parallel merge algorithms (Siebert et al., 2013), and metric geometry on merge trees (Pegoraro, 2021). Each instantiation involves dividing a process into an initial "warmup" or configuration phase, a "stable" or main operation phase, and a final "merge" or aggregation step—either for models, data, or structures. The WSM methodology emphasizes efficiency, stability, and interpretability.

1. WSM in Learning Rate Scheduling for LLMs

The WSM framework for LLM pre-training is a decay-free, post-hoc protocol that replaces conventional Warmup-Stable-Decay (WSD) learning rate schedules. Instead of decaying the learning rate over time, WSM consists of three successive phases: linear warmup, a constant-rate stable phase, and a model merge via checkpoint averaging.

The three phases are:

  • Warmup: Linear increase of learning rate from 0 to lrpeak\mathrm{lr}_{\mathrm{peak}} over TwarmupT_\mathrm{warmup} steps.
  • Stable: Continue training at lrpeak\mathrm{lr}_{\mathrm{peak}} without decay, possibly with an annealing dataset after TswitchT_\mathrm{switch}.
  • Merge: Aggregate the parameters from the last nn checkpoints taken every TcptT_\mathrm{cpt} steps over a merge window Tmerge=nTcptT_\mathrm{merge} = n \cdot T_\mathrm{cpt} using principled weighted averaging.

WSM establishes a rigorous equivalence between checkpoint merging and synthetic learning-rate decay. Specifically, for checkpoints {θt1,,θtk}\{\theta_{t_1},\dots,\theta_{t_k}\} and nonnegative weights {αi}\{\alpha_i\}, the merged model computes

Wmerged=i=1kαiWii=1kαi.W_{\mathrm{merged}} = \frac{\sum_{i=1}^k \alpha_i W_i}{\sum_{i=1}^k \alpha_i}.

The optimal choice of weights TwarmupT_\mathrm{warmup}0 recovers standard decay schemes, as shown by a general inverse mapping (Theorem 3.1 in (Tian et al., 23 Jul 2025)), e.g., uniform averaging for linear decay or 1-sqrt weights for inverse-square-root decay. This direct mapping of decay curves to discrete merging weights allows precise emulation of any standard schedule post-hoc.

Empirical evaluation on MATH, HumanEval, and MMLU-Pro benchmarks demonstrates that WSM consistently outperforms WSD (e.g., +3.5% on MATH, +5.5% on MMLU-Pro), highlighting that merge duration is the most critical hyperparameter, surpassing the impact of checkpoint interval and quantity.

2. WSM in Stable, Parallel Merging Algorithms

In parallel algorithms, WSM principles are embodied in perfectly load-balanced, stable merge procedures for sorted sequences (Siebert et al., 2013). The central innovation is the co-ranking algorithm, which, given sorted arrays TwarmupT_\mathrm{warmup}1 and TwarmupT_\mathrm{warmup}2, efficiently computes—via binary search—the precise input prefixes needed so that each processing element (PE) can independently and stably merge its assigned output block.

Steps:

  1. Warmup: Compute co-ranks for the block boundaries using an TwarmupT_\mathrm{warmup}3 search per boundary.
  2. Stable Main Work: Each PE merges its contiguous subarrays independently, preserving input order (stability).
  3. Merge: The full result is constructed by concatenating the merged blocks.

Every PE receives an (almost) equal-sized output interval and the input partition is determined such that all boundaries preserve stability. The algorithm achieves total work TwarmupT_\mathrm{warmup}4 and speedup TwarmupT_\mathrm{warmup}5 up to TwarmupT_\mathrm{warmup}6. Stability is enforced inherently by the co-ranking conditions, with no extra cost or need for tie-breaking indices. Synchronization is minimized, requiring only local computation and transfer.

3. WSM as a Metric for Merge Trees

WSM also denotes a finitely stable edit distance for merge trees in topological data analysis (Pegoraro, 2021). Given two merge trees TwarmupT_\mathrm{warmup}7 and TwarmupT_\mathrm{warmup}8, WSM distance TwarmupT_\mathrm{warmup}9 is defined as the minimum cost sum over finite sequences of allowed tree-edit operations (shrink/rescale, delete/insert, ghost, split), where edge-weight changes and deletions carry specified costs.

Key properties:

  • Stability: lrpeak\mathrm{lr}_{\mathrm{peak}}0 is finitely stable. There exists a bound: lrpeak\mathrm{lr}_{\mathrm{peak}}1, where lrpeak\mathrm{lr}_{\mathrm{peak}}2 is the interleaving distance.
  • Metric Status: lrpeak\mathrm{lr}_{\mathrm{peak}}3 is a metric (satisfies the triangle inequality).
  • Computation: Calculated by solving a sequence of small binary linear programs, with overall complexity comparable to classical unordered tree distances.
  • Discriminativity: WSM distinguishes subtle tree differences undetected by persistence diagram Wasserstein or bottleneck metrics.

WSM has demonstrated utility in curve regression, clustering, and biomedical imaging, where its finite stability and interpretability enable robust structure-aware comparisons.

4. Algorithmic and Theoretical Details

For lrpeak\mathrm{lr}_{\mathrm{peak}}4 checkpoints lrpeak\mathrm{lr}_{\mathrm{peak}}5 and weights lrpeak\mathrm{lr}_{\mathrm{peak}}6, the merged model equates to a decayed-sum of gradients: lrpeak\mathrm{lr}_{\mathrm{peak}}7 Any decay curve lrpeak\mathrm{lr}_{\mathrm{peak}}8 yields unique non-negative checkpoint weights

lrpeak\mathrm{lr}_{\mathrm{peak}}9

Approximations for linear, cosine, and 1-sqrt decays are explicit, enabling curvature-matched merging.

For two ordered arrays TswitchT_\mathrm{switch}0, TswitchT_\mathrm{switch}1 and TswitchT_\mathrm{switch}2 PEs:

  • Divide TswitchT_\mathrm{switch}3 into TswitchT_\mathrm{switch}4 output blocks.
  • For block TswitchT_\mathrm{switch}5 on PE TswitchT_\mathrm{switch}6: Use co_rank to find TswitchT_\mathrm{switch}7 such that stable_mergeTswitchT_\mathrm{switch}8 produces prefix TswitchT_\mathrm{switch}9.
  • Each PE merges nn0, nn1 to nn2.
  • Time per PE: nn3; total work optimal for nn4.

Given merge trees truncated at large height nn5, represented as weighted trees nn6:

  • Edit operations (with costs) are shrink, delete/insert, ghost, split.
  • The minimal sum over edit-paths gives nn7, which does not depend on nn8.
  • Minimization over edge matchings nn9 yields: TcptT_\mathrm{cpt}0

5. Comparative Performance, Stability, and Recommendations

Empirical investigations in LLM scheduling show WSM delivers consistent, often significant, improvements over WSD across tasks and fine-tuning scenarios (Tian et al., 23 Jul 2025). Merge duration proves to be the dominant hyperparameter for performance, while the choice of decay curve (1-sqrt weights) and offline-to-online merge transition are robust recommendations. WSM is compatible with all major optimizers and decouples schedule selection from knowledge of TcptT_\mathrm{cpt}1.

In parallel merging (Siebert et al., 2013), the WSM approach attains near-optimal work efficiency, ties stability to a simple invariant (never splitting equal-key blocks at block boundaries), and simplifies distributed implementation relative to previous multi-selection schemes.

For merge trees (Pegoraro, 2021), TcptT_\mathrm{cpt}2 offers an TcptT_\mathrm{cpt}3-stable metric that better reflects structural changes than both bottleneck and Wasserstein distances on persistence diagrams, with computational cost comparable to classical tree edit distances and no requirement for ad-hoc saddle corrections or index augmentations.

6. Applications and Extensions

  • LLM Pre-training and Fine-tuning: Post-hoc model averaging using WSM yields better-performing end models without live LR decay, is suitable for schedule sweeps via checkpoint buffering, and generalizes to continual and curriculum learning scenarios (Tian et al., 23 Jul 2025).
  • High-performance Sorting and Merging: WSM-style block partitioning and co-ranking facilitate linear scaling parallel merges in sorting and database applications on both shared-memory and distributed-memory systems (Siebert et al., 2013).
  • Topological Data Summaries: The WSM merge-tree metric is applied for shape-aware curve clustering, regression, and biomedical imaging comparisons, capitalizing on its stability and interpretability (Pegoraro, 2021).

7. Summary Table: WSM Variants and Contexts

Domain Core Operation Stability Notion
LLM Scheduling Weighted checkpoint merge Emulation of LR decay
Parallel Algorithms Blocked stable merge Input-order preservation
Merge Trees Weighted edit distance Finite metric stability

These applications illustrate the broad utility of the Warmup-Stable and Merge paradigm as a foundation for post-hoc optimality, stability, and robust aggregation across machine learning, algorithms, and computational topology.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Warmup-Stable and Merge (WSM).