Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training
Abstract: Model merging has emerged as a lightweight paradigm for enhancing LLMs, yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big picture)
The paper looks at a simple, low-cost trick to make LLMs a bit better without more training: “model merging,” which means averaging together several saved versions of a model near the end of training. The authors discover why this works and how to push it further. They find that when you average late-stage checkpoints, the models line up along almost a straight line (an “almost 1D path”) in the huge space of model weights. Using this, they propose “Extra-Merge,” a way to take a careful extra step along that straight path to get even better performance—without doing any new training.
What questions the paper asks
The paper focuses on two simple questions:
- What shape do the “averaged models” make if you look at a sequence of them? Is there a simple pattern?
- If there is a simple pattern, can we follow it a little farther to improve the model without doing more gradient updates (i.e., without more expensive training)?
How they studied it (in everyday terms)
First, a few quick definitions in plain language:
- A “checkpoint” is a saved copy of the model during training—like a snapshot of its brain at a moment in time.
- “Averaging checkpoints” means taking several of those snapshots near the end and averaging their parameters (their numbers) to get a smoother, more stable model.
- “Loss” is a score for how wrong the model is; lower is better.
- “Principal Component Analysis (PCA)” is a tool that finds the main direction in which a bunch of points vary—like noticing that footprints on a beach mostly follow one direction along the shore.
Their approach is a bit like hiking in a narrow valley:
- Raw training steps: The model’s path bounces side-to-side between steep valley walls (high-curvature directions), sometimes going forward, sometimes wobbling.
- Averaging: If you take several recent positions and average them, you “smooth out” the side-to-side wobbles and end up closer to the valley floor (the safe middle path).
- Geometry check: They test whether averaged checkpoints line up along a simple path by:
- Interpolation tests: Draw a straight line between two checkpoints and measure the loss along that line. Between raw checkpoints, the loss often dips in the middle (a U-shape), showing they’re on opposite valley walls. Between averaged checkpoints, the loss decreases smoothly—showing a straight, downhill path.
- PCA on sequences of checkpoints: They collect several consecutive checkpoints and use PCA to see if most of the variation is along one direction. If so, that means the path is almost a straight line.
Based on this, they propose Extra-Merge:
- Step 1: Find the main direction using the last few averaged checkpoints (PCA finds the “straight path”).
- Step 2: Take a small step farther along that direction and check the loss. If it improved, keep the step; if not, stop. This is like taking one extra stride down the valley floor after smoothing out your path.
Importantly, this needs no gradients or retraining—it’s “training-free.”
What they found and why it matters
- Almost 1D path after averaging: Across different model sizes (GPT-2 124M–1.55B and LLaMA 0.5B–2B), more than 94% of the movement between averaged checkpoints is along one main direction. That means the averaged models lie on an almost straight line.
- Smoother progress: Between raw checkpoints, the loss curve looks like a U-shape (showing side-to-side bouncing). Between averaged checkpoints, the loss just goes down steadily (monotonic), meaning you’re following the valley floor.
- Extra-Merge helps consistently: By stepping a little farther along that main direction, Extra-Merge reduces validation loss more than standard averaging methods. It also gives small but steady gains on real tasks:
- On Pythia-12B, Extra-Merge improves average zero-shot accuracy across benchmarks (ARC, HellaSwag, PIQA) by about +0.59% over the raw model and +0.40% over uniform averaging.
- Works across optimizers: It helps not only with AdamW (a common optimizer) but also with Muon (which uses different update rules), suggesting the effect is general.
- Theory that matches the picture: They explain the valley idea mathematically. Averaging acts like a “low-pass filter,” which means it smooths out the noisy wiggles and reveals the underlying direction downhill. PCA then finds that direction, and stepping a bit farther along it keeps lowering the loss.
Why this is useful (simple implications)
- Free improvements: You can make a model better after training ends—without extra expensive gradient steps—by smartly averaging and then taking one careful extra step (Extra-Merge).
- Faster “convergence”: You reach a better model in fewer training steps overall because you get extra gains from the optimization path that’s already there.
- Better understanding: It gives a clear mental image of late-stage training: raw updates wobble, averaging smooths, and the smoothed path is almost a straight line you can follow a bit further.
- Future ideas: Knowing that a “straight path” emerges suggests new training tricks—like designing optimizers that stay close to that path, or adapting the path when switching domains (for transfer learning) to get benefits faster and cheaper.
In short, the paper shows that near the end of training, averaging checkpoints reveals a simple, almost straight path toward better models. Extra-Merge follows that path a little farther—no extra training required—and reliably improves results.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, action-oriented list of what remains missing, uncertain, or unexplored in the paper.
- Scale generality: Does the Rank-1 subspace phenomenon (EVR > 94% in PC1) persist for substantially larger models (e.g., 7B–70B and beyond) and production-scale LLMs, not just up to 2B pretraining and 12B downstream evaluation?
- Post-training regimes: How does the phenomenon and Extra-Merge behave for instruction-tuned, RLHF-aligned, or multi-modal models, where loss geometry and training dynamics differ from pure pre-training?
- Earlier training phases: The analysis focuses on late-stage pre-training; is there a measurable Rank-1 manifold (and net gain from Extra-Merge) during warmup/high-LR phases or mid-training, and how does it evolve over the course of training?
- Data shifts and domain adaptation: How stable is the subspace under distribution shifts (e.g., continued pre-training on new corpora, domain adaptation, mixture-of-domains)? Does the principal direction reorient, and can Extra-Merge adapt safely?
- Optimizer coverage: Beyond AdamW and a small-scale Muon test, does the effect hold for Adafactor, SGD+momentum, RMSProp, Sophia, and other adaptive/second-order methods, especially at larger scales?
- Baseline breadth: Comparisons are limited to PMA and simple EMAs; how does Extra-Merge fare against stronger merging/acceleration baselines such as Lookahead, SWA with cyclical LRs, Fisher-weighted averaging, Bayesian-optimized merge weights, model souping, and recent scheduler-integrated merging (e.g., WSM)?
- Sensitivity and ablations: The paper lacks systematic ablations on merge interval T, window size N, PCA window K, and extrapolation stride α; what are safe, robust defaults and the sensitivity ranges across models, datasets, and optimizers?
- Automated applicability criterion: The theory requires SNR p > 1 for PCA to recover the descent direction; how can p (or a proxy) be estimated online to decide when Extra-Merge is safe/useful, and to adapt N, K, T automatically?
- Failure modes and guardrails: Under what conditions does extrapolation overshoot or increase loss (e.g., valley bends, regime changes, optimizer restarts)? What trust-region heuristics, curvature checks, or rollback criteria minimize risk?
- Beyond 1D extrapolation: Is there measurable gain from 2–3D subspace search (e.g., trust-region optimization in the span of top PCs), or from incorporating curvature along PC1 to choose step sizes more principledly than greedy line search?
- Layer-/module-wise structure: Does the Rank-1 structure hold uniformly across layers/modules, or do different blocks exhibit different principal directions? Would blockwise/local PCA and per-module extrapolation yield larger or safer gains?
- Metric and parameterization issues: PCA is performed in raw Euclidean weight space, which ignores parameter symmetries and scale invariances; would Fisher-Rao or function-space metrics (or weight reparameterizations) change the observed rank and direction?
- Computational scalability: Performing PCA on billions of parameters is nontrivial; what are the memory/time costs at scale, and do randomized/streaming/blockwise PCA approximations preserve direction quality and gains?
- Checkpointing overhead: Extra-Merge assumes frequent, regularly spaced checkpoints; what is the trade-off between checkpoint frequency, storage budget, and downstream gains, and how sparse can checkpoints be before performance degrades?
- Validation cost: The adaptive line search requires repeated validation evaluations; what is the overhead at scale, and can one reduce it (e.g., with proxy losses, subsets, or low-cost estimators) without harming outcomes?
- Continued-training stability: If training resumes from an Extra-Merge-extrapolated model, how do missing optimizer states (e.g., moments) and extrapolated weights affect stability, convergence, and final generalization?
- Generalization breadth: Improvements are shown on a small set of zero-shot tasks; do gains transfer to broader suites (e.g., MMLU, GSM8K, BBH, HumanEval, multilingual) and to calibration, robustness, and toxicity metrics?
- Robustness across runs: Are the Rank-1 EVR statistics and Extra-Merge gains consistent across multiple random seeds, data shuffles, and hardware/software stacks?
- Nonlinear path geometry: The monotonic interpolation/extrapolation analyses assume locally linear valleys; how often does the valley bend or fork, and how far along PC1 can one move before leaving the low-loss manifold?
- Theoretical scope: The river–valley analysis posits a 1D “river”; when is the river higher-dimensional, and how does that affect PCA’s ability to recover the descent subspace and the correctness of 1D extrapolation?
- Optimizer-state-aware theory: The theoretical model approximates late-stage SGD-like dynamics; can the theory be extended to AdamW/Muon with momentum, decoupled weight decay, and adaptive preconditioning to match practice?
- Practical guidance: The paper gives qualitative hyperparameter advice but lacks quantitative recipes (e.g., closed-form or empirical rules for N, K, T, α given noise levels, batch sizes, and LR schedules).
- Interaction with learning-rate schedules and batch size: How do different LR schedules (constant, cosine restarts, cyclical) and batch sizes/noise scales affect SNR, rank collapse, and the efficacy of Extra-Merge?
- Safety and alignment: Do extrapolated weights preserve alignment properties, refusal behavior, and safety guarantees in RLHF/instruction-tuned models, or can extrapolation degrade alignment even if perplexity improves?
- Cross-task and multi-task merging: Can the method be generalized from a single-run trajectory to merging trajectories across tasks/datasets (e.g., task arithmetic) without harming performance due to destructive interference?
- Function-space alternatives: Would extrapolating in function space (e.g., via feature representations or ensemble distillation) provide safer or more predictable improvements than weight-space extrapolation?
- Reproducibility and release: Implementation details for large-scale PCA and line search are sparse; releasing code and detailed settings (especially for 2B and Pythia-12B runs) would enable independent verification and stress testing.
Practical Applications
Practical Applications of “Extra-Merge: Tracing the Rank‑1 Subspace of Model Merging in LLM Pre‑Training”
Below are actionable real-world applications derived from the paper’s findings and the Extra-Merge method. They are grouped by time-to-deploy and annotated with sectors, possible tools/workflows, and feasibility assumptions.
Immediate Applications
These can be deployed now with modest engineering effort, using existing checkpoints and tooling.
- Training-free performance boosts at the end of pre-training
- What: Add an Extra-Merge “finalization” step that extrapolates along the rank-1 subspace extracted from the last N averaged checkpoints to reduce loss without gradient updates.
- Sectors: Software/AI infrastructure, Cloud ML, Foundation model labs; indirect benefits to all downstream sectors using LLMs (healthcare, finance, education).
- Tools/workflows:
- A PyTorch/DeepSpeed/Megatron-LM plugin that: (1) retains a rolling window of late-stage checkpoints, (2) performs PMA/LAWA, (3) runs PCA on merged checkpoints, (4) does a 1D adaptive line search, (5) emits finalized weights.
- CLI utility (“extra-merge”) for post-hoc improvement of open checkpoints (e.g., Pythia-12B) before release.
- Assumptions/dependencies:
- Access to several late-stage checkpoints and a validation set for line-search evaluation.
- The late-stage trajectory exhibits the paper’s observed Rank-1 subspace (validated for 124M–2B parameters; successful zero-shot gains for Pythia‑12B).
- Modest forward-pass budget for line search; careful step-size to avoid overshoot.
- Compute and carbon savings via earlier stopping at higher quality
- What: Use Extra-Merge to reach a lower loss at the same step (or the same loss earlier), cutting GPU hours and CO₂ emissions.
- Sectors: Energy/sustainability reporting, Cloud cost optimization; any industry running LLM training.
- Tools/workflows:
- “Early-stop with Extra-Merge” policy: when LR enters decay and metrics plateau, trigger Extra-Merge to meet targets sooner.
- Carbon dashboards that attribute reductions to training-free extrapolation.
- Assumptions/dependencies:
- Reliable validation metrics correlate with downstream goals.
- Checkpoint cadence (T) and window (N, K) are tuned to reach SNR > 1 as per the paper’s theory.
- Optimizer-agnostic model finalization
- What: Apply Extra-Merge regardless of optimizer (validated on AdamW and Muon’s orthogonal updates), making it a robust, generic post-processing step.
- Sectors: Software/AI infrastructure, open-source model release pipelines.
- Tools/workflows:
- Optimizer-agnostic checkpoint finalizer embedded in training frameworks.
- Assumptions/dependencies:
- Late-stage checkpoints saved with sufficient spacing to decorrelate “mountain” noise (theory suggests larger T helps).
- MLOps best practice: checkpoint retention and geometric health monitoring
- What:
- Adopt a late-stage checkpoint retention policy (e.g., keep last 8–10 at fixed intervals) to enable Extra-Merge.
- Add a “Rank‑1 monitor” that runs PCA on merged checkpoints and alerts when PC1 variance ratio drops (potential training instability).
- Sectors: MLOps/DevOps, enterprise AI.
- Tools/workflows:
- Lightweight PCA on parameter deltas or low-rank parameter sketches to monitor EVR(PC1).
- Assumptions/dependencies:
- Storage budget for late-stage checkpoints; PCA on large models may require sharded or blockwise methods.
- “Free” quality bump for open-source and internal models before release
- What: For models with accessible checkpoints (e.g., research or internal), use Extra-Merge to improve perplexity and small but consistent zero-shot accuracy (e.g., +0.59% avg on Pythia‑12B across ARC, HellaSwag, PIQA).
- Sectors: Open-source communities, product teams deploying general-purpose chatbots and assistants.
- Tools/workflows:
- Release engineering step that consumes a directory of checkpoints and exports an extrapolated “v1.0-final” model.
- Assumptions/dependencies:
- Public or internal availability of intermediate checkpoints; small additional evaluation budget for line search.
- Lightweight enhancement for continual/periodic training workflows
- What: In continuous pre-training or periodic refreshes, attach an Extra-Merge pass at the end of each cycle to “harvest” late-stage improvements.
- Sectors: Enterprise search, recommendation systems, in-house LLMs supporting healthcare/finance/education content.
- Tools/workflows:
- Scheduled job in Airflow/Argo to run Extra-Merge after each training tranche.
- Assumptions/dependencies:
- Stable pre-training objective within each tranche; retained late-stage checkpoints.
Long-Term Applications
These require further research, scaling, or broader validation (e.g., larger models, diverse tasks, modalities, safety considerations).
- Subspace-aware optimizers and schedulers
- What: Design training algorithms that explicitly align updates with the inferred rank‑1 “river” direction; adapt LR schedules/samplers to remain on the valley floor.
- Sectors: Software/AI infrastructure, chip vendors (compiler/firmware level), research labs.
- Tools/products:
- “River-following” optimizers; LR schedulers that use EVR(PC1) as a control signal.
- Assumptions/dependencies:
- Robust online estimation of subspace under non-stationary data; validation on larger scales (e.g., >13B).
- Fast domain adaptation via subspace reorientation
- What: Map how the rank‑1 subspace shifts when the data distribution changes; perform training-free or few-step extrapolation to adapt foundation models to new domains (e.g., clinical, legal).
- Sectors: Healthcare, finance, legal-tech, scientific NLP.
- Tools/products:
- “Geometric adapters” that compute domain-specific subspace adjustments from short adaptation runs.
- Assumptions/dependencies:
- Stable and measurable subspace drift across domains; careful evaluation of safety and compliance risks in regulated sectors.
- Safer and more reproducible model finalization standards
- What: Establish release standards that include a short window of late-stage checkpoints and a documented Extra-Merge recipe to improve reproducibility and energy efficiency.
- Sectors: Policy/governance, procurement, academic publishing.
- Tools/workflows:
- Model cards with checkpoint-window metadata; reproducible “finalization” scripts and seeds.
- Assumptions/dependencies:
- Willingness to share late-stage checkpoints (IP/safety considerations); standardized evaluation suites.
- AutoML and training controllers that use geometric signals
- What: Incorporate EVR(PC1), monotonicity along PC1, and SNR estimates into AutoML systems for dynamic hyperparameter control (e.g., when to decay LR, when to stop, when to checkpoint more frequently).
- Sectors: Cloud ML platforms, enterprise ML.
- Tools/products:
- Controllers that adjust T, N, K online to satisfy SNR > 1 for stable extrapolation.
- Assumptions/dependencies:
- Reliable online metrics; low-overhead geometric estimation for very large models.
- Extending to fine-tuning, RLHF, and multimodal models
- What: Explore whether the rank‑1 subspace and Extra-Merge extrapolation hold during instruction tuning, RLHF phases, and cross-modal pre-training (vision, speech, robotics).
- Sectors: General AI products, robotics, multimodal assistants, education technology.
- Tools/products:
- “Training-free” quality boosts post-RLHF; extrapolated checkpoints for multimodal encoders/decoders.
- Assumptions/dependencies:
- Empirical verification in non-language domains and post-RLHF phases; safety testing to avoid undesired behavior shifts.
- Cross-run and multi-objective “soups with extrapolation”
- What: Combine model soups/merging from multiple seeds or objectives, then perform rank‑1 extrapolation on the merged trajectory to push further into low-loss regions.
- Sectors: Research, ensemble/deployment optimization in industry.
- Tools/products:
- Soup-then-extrapolate pipelines; Bayesian or evolutionary search over merge weights followed by Extra-Merge.
- Assumptions/dependencies:
- Low-loss linear connectivity among models; robust detection of a coherent descent direction after merging.
- Hardware- and system-level accelerations
- What: Add parameter-sketching PCA and 1D line search kernels into training stacks (e.g., NCCL-integrated sharded PCA, GPU-friendly eigen-solvers) for large-scale deployment.
- Sectors: Cloud providers, accelerator vendors.
- Tools/products:
- Sharded/streaming PCA for 70B+ models; inference-only extrapolation services.
- Assumptions/dependencies:
- Efficient memory/distributed implementations; validation of numerical stability at scale.
- Public-good and sustainability initiatives
- What: Policymakers or funding agencies incentivize “training-free finalization” steps (like Extra-Merge) as best practice for energy-efficient research and procurement.
- Sectors: Policy, funding bodies, sustainability.
- Tools/workflows:
- Grant/reporting templates tracking checkpoint-based finalization and estimated compute/emissions savings.
- Assumptions/dependencies:
- Consensus on metrics linking extrapolated perplexity improvements to societal benefit; standardized reporting.
Notes on feasibility and scope
- The method is most reliable in late-stage pre-training where the paper observes a strong Rank‑1 subspace (PC1 explains >94% variance in their settings). Earlier phases or unstable regimes may not benefit.
- Dependencies include: checkpoint cadence (T), averaging window (N), PCA window (K), and a validation set for line search. The paper’s theory suggests SNR improves with larger N and temporal span KT but must stay local enough for the “straight-river” approximation.
- Memory/computation for PCA on very large models may require approximate, sharded, or layer-wise approaches.
- Although empirically robust across AdamW and Muon and across 124M–2B and Pythia‑12B zero-shot tasks, further validation is prudent for much larger models, other modalities, and safety-sensitive fine-tuning.
Glossary
- AdamW: An optimizer that decouples weight decay from the gradient-based update to improve generalization. "typically using optimizers such as AdamW."
- ARC-Challenge: A difficult multiple-choice question answering benchmark assessing reasoning. "ARC-Challenge, ARC-Easy, HellaSwag, and PIQA."
- ARC-Easy: An easier subset of the ARC question answering benchmark. "ARC-Challenge, ARC-Easy, HellaSwag, and PIQA."
- Colossal Clean Crawled Corpus (C4): A large-scale cleaned web text dataset commonly used for LLM pre-training. "on the Colossal Clean Crawled Corpus (C4) dataset."
- Convex hull: The smallest convex set containing a collection of points; here, the set spanned by observed model weights. "extrapolate beyond the convex hull of ob- served weights."
- Cosine annealing: A learning rate schedule that decays the rate following a cosine curve. "or Cosine annealing."
- Explained Variance Ratio (EVR): The fraction of total variance captured by a given principal component in PCA. "the Explained Variance Ratio (EVR) of the k-th principal com- ponent"
- Exponential Moving Average (EMA): A weighted averaging scheme that emphasizes recent checkpoints via exponential decay. "While EMA is standard in convex optimization,"
- Extra-Merge: A training-free method that extrapolates along an inferred low-dimensional subspace of merged checkpoints to reduce loss. "we propose Extra-Merge, a training-free algorithm designed to exploit this geometric stability."
- FineWeb: A curated large-scale web text dataset used for pre-training. "GPT-2. Small on FineWeb"
- Hessian: The matrix of second derivatives of the loss; its eigenstructure characterizes curvature directions. "H is a positive semi-definite Hessian matrix"
- Latest Weight Averaging (LAWA): A framework that averages the latest pre-training checkpoints to improve performance. "Latest Weight Averaging (LAWA) (Kaddour, 2022; Sanyal et al., 2023)"
- Linear Mode Connectivity (LMC): The empirical observation that independently trained models can be connected by a low-loss linear path. "the phenomenon of Linear Mode Connectivity (LMC)"
- LLaMA: A family of open LLM architectures. "LLAMA on C4."
- Model merging: Combining multiple model checkpoints (typically by averaging) to obtain a better-performing model. "model merging has emerged as a potent, training-free paradigm"
- Muon optimizer: An optimizer that employs orthogonal update strategies, differing from AdamW-style updates. "the Muon optimizer (Jordan et al., 2024)."
- Orthogonal updates: Parameter updates constrained to be orthogonal to certain directions, altering trajectory geometry during training. "Muon is characterized by its use of orthogonal updates"
- Polyak averaging: Averaging successive parameter iterates to accelerate convergence and reduce variance in stochastic approximation. "Polyak averaging (Polyak & Judit- sky, 1992)"
- Pre-trained Model Averaging (PMA): Uniform averaging of late-stage checkpoints, a strong baseline for LLM pre-training. "Baseline: Pre-trained Model Averaging (PMA)."
- Principal Component Analysis (PCA): A spectral method to identify dominant directions of variation in parameter trajectories. "perform Principal Component Analysis (PCA)."
- Rank-1 Subspace: An approximately one-dimensional subspace capturing the dominant direction of the merged trajectory. "the Rank-1. Subspace is a robust, optimizer-agnostic property of LLM training."
- River-Valley landscape: A loss landscape model with a flat “river” direction and sharp “mountain” subspace governing dynamics. "under the river-valley loss framework (Wen et al., 2024)"
- Signal-to-Noise Ratio (SNR): The ratio quantifying strength of directional drift relative to residual noise in the trajectory. "Define the Signal-to- Noise Ratio (SNR) of the trajectory"
- Sliding window: A procedure that uses the most recent K checkpoints to estimate local structure (e.g., for PCA). "We apply PCA to a sliding window of the K most recent merged checkpoints"
- Stochastic Weight Averaging (SWA): Averaging model weights sampled along the training path to land in wider optima. "Stochastic Weight Averaging (SWA) (Izmailov et al., 2018)"
- Warmup-Stable-Decay (WSD): A learning rate schedule with warmup, a stable phase, and a gradual decay. "such as Warmup-Stable-Decay (WSD) (Hu et al., 2024)"
- Zero-shot accuracy: Evaluation performance on tasks without any task-specific fine-tuning. "yields consistent zero-shot accuracy gains"
Collections
Sign up for free to add this paper to one or more collections.