Warmup-Stable and Merge (WSM) Techniques

Updated 3 July 2026

WSM is a paradigm that splits processes into warmup, stable, and merge phases to enhance efficiency, stability, and interpretability.
In LLM pre-training, WSM replaces decay schedules with checkpoint averaging, achieving improvements of up to +5.5% on benchmarks.
The approach extends to parallel merge algorithms and merge tree metrics, providing near-optimal load balancing and precise structural comparisons.

Warmup-Stable and Merge (WSM) refers to a class of techniques across learning rate scheduling in LLM pre-training (Tian et al., 23 Jul 2025), parallel merge algorithms (Siebert et al., 2013), and metric geometry on merge trees (Pegoraro, 2021). Each instantiation involves dividing a process into an initial "warmup" or configuration phase, a "stable" or main operation phase, and a final "merge" or aggregation step—either for models, data, or structures. The WSM methodology emphasizes efficiency, stability, and interpretability.

1. WSM in Learning Rate Scheduling for LLMs

The WSM framework for LLM pre-training is a decay-free, post-hoc protocol that replaces conventional Warmup-Stable-Decay (WSD) learning rate schedules. Instead of decaying the learning rate over time, WSM consists of three successive phases: linear warmup, a constant-rate stable phase, and a model merge via checkpoint averaging.

The three phases are:

Warmup: Linear increase of learning rate from 0 to $\mathrm{lr}_{\mathrm{peak}}$ over $T_\mathrm{warmup}$ steps.
Stable: Continue training at $\mathrm{lr}_{\mathrm{peak}}$ without decay, possibly with an annealing dataset after $T_\mathrm{switch}$ .
Merge: Aggregate the parameters from the last $n$ checkpoints taken every $T_\mathrm{cpt}$ steps over a merge window $T_\mathrm{merge} = n \cdot T_\mathrm{cpt}$ using principled weighted averaging.

WSM establishes a rigorous equivalence between checkpoint merging and synthetic learning-rate decay. Specifically, for checkpoints $\{\theta_{t_1},\dots,\theta_{t_k}\}$ and nonnegative weights $\{\alpha_i\}$ , the merged model computes

$W_{\mathrm{merged}} = \frac{\sum_{i=1}^k \alpha_i W_i}{\sum_{i=1}^k \alpha_i}.$

The optimal choice of weights $T_\mathrm{warmup}$ 0 recovers standard decay schemes, as shown by a general inverse mapping (Theorem 3.1 in (Tian et al., 23 Jul 2025)), e.g., uniform averaging for linear decay or 1-sqrt weights for inverse-square-root decay. This direct mapping of decay curves to discrete merging weights allows precise emulation of any standard schedule post-hoc.

Empirical evaluation on MATH, HumanEval, and MMLU-Pro benchmarks demonstrates that WSM consistently outperforms WSD (e.g., +3.5% on MATH, +5.5% on MMLU-Pro), highlighting that merge duration is the most critical hyperparameter, surpassing the impact of checkpoint interval and quantity.

2. WSM in Stable, Parallel Merging Algorithms

In parallel algorithms, WSM principles are embodied in perfectly load-balanced, stable merge procedures for sorted sequences (Siebert et al., 2013). The central innovation is the co-ranking algorithm, which, given sorted arrays $T_\mathrm{warmup}$ 1 and $T_\mathrm{warmup}$ 2, efficiently computes—via binary search—the precise input prefixes needed so that each processing element (PE) can independently and stably merge its assigned output block.

Steps:

Warmup: Compute co-ranks for the block boundaries using an $T_\mathrm{warmup}$ 3 search per boundary.
Stable Main Work: Each PE merges its contiguous subarrays independently, preserving input order (stability).
Merge: The full result is constructed by concatenating the merged blocks.

Every PE receives an (almost) equal-sized output interval and the input partition is determined such that all boundaries preserve stability. The algorithm achieves total work $T_\mathrm{warmup}$ 4 and speedup $T_\mathrm{warmup}$ 5 up to $T_\mathrm{warmup}$ 6. Stability is enforced inherently by the co-ranking conditions, with no extra cost or need for tie-breaking indices. Synchronization is minimized, requiring only local computation and transfer.

3. WSM as a Metric for Merge Trees

WSM also denotes a finitely stable edit distance for merge trees in topological data analysis (Pegoraro, 2021). Given two merge trees $T_\mathrm{warmup}$ 7 and $T_\mathrm{warmup}$ 8, WSM distance $T_\mathrm{warmup}$ 9 is defined as the minimum cost sum over finite sequences of allowed tree-edit operations (shrink/rescale, delete/insert, ghost, split), where edge-weight changes and deletions carry specified costs.

Key properties:

Stability: $\mathrm{lr}_{\mathrm{peak}}$ 0 is finitely stable. There exists a bound: $\mathrm{lr}_{\mathrm{peak}}$ 1, where $\mathrm{lr}_{\mathrm{peak}}$ 2 is the interleaving distance.
Metric Status: $\mathrm{lr}_{\mathrm{peak}}$ 3 is a metric (satisfies the triangle inequality).
Computation: Calculated by solving a sequence of small binary linear programs, with overall complexity comparable to classical unordered tree distances.
Discriminativity: WSM distinguishes subtle tree differences undetected by persistence diagram Wasserstein or bottleneck metrics.

WSM has demonstrated utility in curve regression, clustering, and biomedical imaging, where its finite stability and interpretability enable robust structure-aware comparisons.

4. Algorithmic and Theoretical Details

For $\mathrm{lr}_{\mathrm{peak}}$ 4 checkpoints $\mathrm{lr}_{\mathrm{peak}}$ 5 and weights $\mathrm{lr}_{\mathrm{peak}}$ 6, the merged model equates to a decayed-sum of gradients: $\mathrm{lr}_{\mathrm{peak}}$ 7 Any decay curve $\mathrm{lr}_{\mathrm{peak}}$ 8 yields unique non-negative checkpoint weights

$\mathrm{lr}_{\mathrm{peak}}$ 9

Approximations for linear, cosine, and 1-sqrt decays are explicit, enabling curvature-matched merging.

For two ordered arrays $T_\mathrm{switch}$ 0, $T_\mathrm{switch}$ 1 and $T_\mathrm{switch}$ 2 PEs:

Divide $T_\mathrm{switch}$ 3 into $T_\mathrm{switch}$ 4 output blocks.
For block $T_\mathrm{switch}$ 5 on PE $T_\mathrm{switch}$ 6: Use co_rank to find $T_\mathrm{switch}$ 7 such that stable_merge $T_\mathrm{switch}$ 8 produces prefix $T_\mathrm{switch}$ 9.
Each PE merges $n$ 0, $n$ 1 to $n$ 2.
Time per PE: $n$ 3; total work optimal for $n$ 4.

Given merge trees truncated at large height $n$ 5, represented as weighted trees $n$ 6:

Edit operations (with costs) are shrink, delete/insert, ghost, split.
The minimal sum over edit-paths gives $n$ 7, which does not depend on $n$ 8.
Minimization over edge matchings $n$ 9 yields: $T_\mathrm{cpt}$ 0

5. Comparative Performance, Stability, and Recommendations

Empirical investigations in LLM scheduling show WSM delivers consistent, often significant, improvements over WSD across tasks and fine-tuning scenarios (Tian et al., 23 Jul 2025). Merge duration proves to be the dominant hyperparameter for performance, while the choice of decay curve (1-sqrt weights) and offline-to-online merge transition are robust recommendations. WSM is compatible with all major optimizers and decouples schedule selection from knowledge of $T_\mathrm{cpt}$ 1.

In parallel merging (Siebert et al., 2013), the WSM approach attains near-optimal work efficiency, ties stability to a simple invariant (never splitting equal-key blocks at block boundaries), and simplifies distributed implementation relative to previous multi-selection schemes.

For merge trees (Pegoraro, 2021), $T_\mathrm{cpt}$ 2 offers an $T_\mathrm{cpt}$ 3-stable metric that better reflects structural changes than both bottleneck and Wasserstein distances on persistence diagrams, with computational cost comparable to classical tree edit distances and no requirement for ad-hoc saddle corrections or index augmentations.

6. Applications and Extensions

LLM Pre-training and Fine-tuning: Post-hoc model averaging using WSM yields better-performing end models without live LR decay, is suitable for schedule sweeps via checkpoint buffering, and generalizes to continual and curriculum learning scenarios (Tian et al., 23 Jul 2025).
High-performance Sorting and Merging: WSM-style block partitioning and co-ranking facilitate linear scaling parallel merges in sorting and database applications on both shared-memory and distributed-memory systems (Siebert et al., 2013).
Topological Data Summaries: The WSM merge-tree metric is applied for shape-aware curve clustering, regression, and biomedical imaging comparisons, capitalizing on its stability and interpretability (Pegoraro, 2021).

7. Summary Table: WSM Variants and Contexts

Domain	Core Operation	Stability Notion
LLM Scheduling	Weighted checkpoint merge	Emulation of LR decay
Parallel Algorithms	Blocked stable merge	Input-order preservation
Merge Trees	Weighted edit distance	Finite metric stability

These applications illustrate the broad utility of the Warmup-Stable and Merge paradigm as a foundation for post-hoc optimality, stability, and robust aggregation across machine learning, algorithms, and computational topology.

Markdown Report Issue Upgrade to Chat

References (3)

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training (2025)

Perfectly load-balanced, optimal, stable, parallel merge (2013)

A Finitely Stable Edit Distance for Merge Trees (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Warmup-Stable and Merge (WSM).

Warmup-Stable and Merge (WSM) Techniques

1. WSM in Learning Rate Scheduling for LLMs

2. WSM in Stable, Parallel Merging Algorithms

3. WSM as a Metric for Merge Trees

4. Algorithmic and Theoretical Details

Learning Rate Merging (Tian et al., 23 Jul 2025)

Parallel Merge (Siebert et al., 2013)

Merge Tree Metric (Pegoraro, 2021)

5. Comparative Performance, Stability, and Recommendations

6. Applications and Extensions

7. Summary Table: WSM Variants and Contexts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Warmup-Stable and Merge (WSM) Techniques

1. WSM in Learning Rate Scheduling for LLMs

2. WSM in Stable, Parallel Merging Algorithms

3. WSM as a Metric for Merge Trees

4. Algorithmic and Theoretical Details

Learning Rate Merging (Tian et al., 23 Jul 2025)

Parallel Merge (Siebert et al., 2013)

Merge Tree Metric (Pegoraro, 2021)

5. Comparative Performance, Stability, and Recommendations

6. Applications and Extensions

7. Summary Table: WSM Variants and Contexts

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics