Warmup-Stable and Merge (WSM) Techniques
- WSM is a paradigm that splits processes into warmup, stable, and merge phases to enhance efficiency, stability, and interpretability.
- In LLM pre-training, WSM replaces decay schedules with checkpoint averaging, achieving improvements of up to +5.5% on benchmarks.
- The approach extends to parallel merge algorithms and merge tree metrics, providing near-optimal load balancing and precise structural comparisons.
Warmup-Stable and Merge (WSM) refers to a class of techniques across learning rate scheduling in LLM pre-training (Tian et al., 23 Jul 2025), parallel merge algorithms (Siebert et al., 2013), and metric geometry on merge trees (Pegoraro, 2021). Each instantiation involves dividing a process into an initial "warmup" or configuration phase, a "stable" or main operation phase, and a final "merge" or aggregation step—either for models, data, or structures. The WSM methodology emphasizes efficiency, stability, and interpretability.
1. WSM in Learning Rate Scheduling for LLMs
The WSM framework for LLM pre-training is a decay-free, post-hoc protocol that replaces conventional Warmup-Stable-Decay (WSD) learning rate schedules. Instead of decaying the learning rate over time, WSM consists of three successive phases: linear warmup, a constant-rate stable phase, and a model merge via checkpoint averaging.
The three phases are:
- Warmup: Linear increase of learning rate from 0 to over steps.
- Stable: Continue training at without decay, possibly with an annealing dataset after .
- Merge: Aggregate the parameters from the last checkpoints taken every steps over a merge window using principled weighted averaging.
WSM establishes a rigorous equivalence between checkpoint merging and synthetic learning-rate decay. Specifically, for checkpoints and nonnegative weights , the merged model computes
The optimal choice of weights 0 recovers standard decay schemes, as shown by a general inverse mapping (Theorem 3.1 in (Tian et al., 23 Jul 2025)), e.g., uniform averaging for linear decay or 1-sqrt weights for inverse-square-root decay. This direct mapping of decay curves to discrete merging weights allows precise emulation of any standard schedule post-hoc.
Empirical evaluation on MATH, HumanEval, and MMLU-Pro benchmarks demonstrates that WSM consistently outperforms WSD (e.g., +3.5% on MATH, +5.5% on MMLU-Pro), highlighting that merge duration is the most critical hyperparameter, surpassing the impact of checkpoint interval and quantity.
2. WSM in Stable, Parallel Merging Algorithms
In parallel algorithms, WSM principles are embodied in perfectly load-balanced, stable merge procedures for sorted sequences (Siebert et al., 2013). The central innovation is the co-ranking algorithm, which, given sorted arrays 1 and 2, efficiently computes—via binary search—the precise input prefixes needed so that each processing element (PE) can independently and stably merge its assigned output block.
Steps:
- Warmup: Compute co-ranks for the block boundaries using an 3 search per boundary.
- Stable Main Work: Each PE merges its contiguous subarrays independently, preserving input order (stability).
- Merge: The full result is constructed by concatenating the merged blocks.
Every PE receives an (almost) equal-sized output interval and the input partition is determined such that all boundaries preserve stability. The algorithm achieves total work 4 and speedup 5 up to 6. Stability is enforced inherently by the co-ranking conditions, with no extra cost or need for tie-breaking indices. Synchronization is minimized, requiring only local computation and transfer.
3. WSM as a Metric for Merge Trees
WSM also denotes a finitely stable edit distance for merge trees in topological data analysis (Pegoraro, 2021). Given two merge trees 7 and 8, WSM distance 9 is defined as the minimum cost sum over finite sequences of allowed tree-edit operations (shrink/rescale, delete/insert, ghost, split), where edge-weight changes and deletions carry specified costs.
Key properties:
- Stability: 0 is finitely stable. There exists a bound: 1, where 2 is the interleaving distance.
- Metric Status: 3 is a metric (satisfies the triangle inequality).
- Computation: Calculated by solving a sequence of small binary linear programs, with overall complexity comparable to classical unordered tree distances.
- Discriminativity: WSM distinguishes subtle tree differences undetected by persistence diagram Wasserstein or bottleneck metrics.
WSM has demonstrated utility in curve regression, clustering, and biomedical imaging, where its finite stability and interpretability enable robust structure-aware comparisons.
4. Algorithmic and Theoretical Details
Learning Rate Merging (Tian et al., 23 Jul 2025)
For 4 checkpoints 5 and weights 6, the merged model equates to a decayed-sum of gradients: 7 Any decay curve 8 yields unique non-negative checkpoint weights
9
Approximations for linear, cosine, and 1-sqrt decays are explicit, enabling curvature-matched merging.
Parallel Merge (Siebert et al., 2013)
For two ordered arrays 0, 1 and 2 PEs:
- Divide 3 into 4 output blocks.
- For block 5 on PE 6: Use co_rank to find 7 such that stable_merge8 produces prefix 9.
- Each PE merges 0, 1 to 2.
- Time per PE: 3; total work optimal for 4.
Merge Tree Metric (Pegoraro, 2021)
Given merge trees truncated at large height 5, represented as weighted trees 6:
- Edit operations (with costs) are shrink, delete/insert, ghost, split.
- The minimal sum over edit-paths gives 7, which does not depend on 8.
- Minimization over edge matchings 9 yields: 0
5. Comparative Performance, Stability, and Recommendations
Empirical investigations in LLM scheduling show WSM delivers consistent, often significant, improvements over WSD across tasks and fine-tuning scenarios (Tian et al., 23 Jul 2025). Merge duration proves to be the dominant hyperparameter for performance, while the choice of decay curve (1-sqrt weights) and offline-to-online merge transition are robust recommendations. WSM is compatible with all major optimizers and decouples schedule selection from knowledge of 1.
In parallel merging (Siebert et al., 2013), the WSM approach attains near-optimal work efficiency, ties stability to a simple invariant (never splitting equal-key blocks at block boundaries), and simplifies distributed implementation relative to previous multi-selection schemes.
For merge trees (Pegoraro, 2021), 2 offers an 3-stable metric that better reflects structural changes than both bottleneck and Wasserstein distances on persistence diagrams, with computational cost comparable to classical tree edit distances and no requirement for ad-hoc saddle corrections or index augmentations.
6. Applications and Extensions
- LLM Pre-training and Fine-tuning: Post-hoc model averaging using WSM yields better-performing end models without live LR decay, is suitable for schedule sweeps via checkpoint buffering, and generalizes to continual and curriculum learning scenarios (Tian et al., 23 Jul 2025).
- High-performance Sorting and Merging: WSM-style block partitioning and co-ranking facilitate linear scaling parallel merges in sorting and database applications on both shared-memory and distributed-memory systems (Siebert et al., 2013).
- Topological Data Summaries: The WSM merge-tree metric is applied for shape-aware curve clustering, regression, and biomedical imaging comparisons, capitalizing on its stability and interpretability (Pegoraro, 2021).
7. Summary Table: WSM Variants and Contexts
| Domain | Core Operation | Stability Notion |
|---|---|---|
| LLM Scheduling | Weighted checkpoint merge | Emulation of LR decay |
| Parallel Algorithms | Blocked stable merge | Input-order preservation |
| Merge Trees | Weighted edit distance | Finite metric stability |
These applications illustrate the broad utility of the Warmup-Stable and Merge paradigm as a foundation for post-hoc optimality, stability, and robust aggregation across machine learning, algorithms, and computational topology.