Pre-trained Model Averaging (PMA)

Updated 5 April 2026

Pre-trained Model Averaging (PMA) is a technique that combines multiple fine-tuned neural models by averaging their parameters to create a single, robust model.
It encompasses methods such as uniform and greedy model soups, stochastic weight averaging, and PAC-Bayes approaches for adaptive weight-space aggregation.
Empirical results demonstrate that PMA enhances generalization, reduces variance, and improves out-of-distribution performance across diverse modalities.

Pre-trained Model Averaging (PMA) is a collective term for an expanding family of techniques designed to synthesize a single neural model from multiple trained (usually fine-tuned) models, or sequential model snapshots, by directly averaging their parameters or by more expressive weight-space aggregation. PMA methods have been demonstrated to enhance generalization, boost out-of-distribution robustness, and accelerate transfer, all at negligible inference overhead, across diverse modalities (NLP, vision), model scales, and data regimes. The following sections review core definitions, algorithmic frameworks, theoretical underpinnings, representative practical applications, empirical findings, and prominent variants within the PMA landscape.

1. Core Definitions and Model Averaging Methods

PMA is fundamentally the process of synthesizing a new set of model weights $\theta_\mathrm{avg}$ from a collection $\{\theta_i\}_{i=1}^K$ of reference weights, typically as a convex combination:

$\theta_\mathrm{avg} = \sum_{i=1}^K \alpha_i\,\theta_i \quad \text{with} \quad \sum_{i=1}^K \alpha_i = 1,\quad \alpha_i \geq 0$

The most frequently used choices are uniform averaging ( $\alpha_i = 1/K$ for all $i$ ), as in model soups (Wortsman et al., 2022), but numerous schemes select weights adaptively, e.g. greedy soups (Wortsman et al., 2022), PAC-Bayesian posteriors (Huang et al., 2019), entropy-minimizing mixture weights (Park, 28 May 2025), or meta-learned pseudogradient updates (Lee et al., 20 Aug 2025).

A distinction exists between:

Trajectory averaging: combining multiple checkpoints along a single training run (e.g., Stochastic Weight Averaging (SWA) (Lu et al., 2022), moving averages in pre-training (Li et al., 17 May 2025), Polyak-Ruppert for convex losses).
Checkpoint fusion across runs/tasks: merging weights from models trained on different data, tasks, or hyperparameters, as in model soups (Wortsman et al., 2022), task-family fusion (Choshen et al., 2022), and cross-run accumulative schemes in cross-lingual transfer (Schmidt et al., 2023).

In neural PMA, all reference models must be strictly compatible in architecture and weight layout (parameter name and shape alignment).

2. Algorithmic Frameworks and Representative Variants

Several canonical PMA recipes are prominent in the literature:

Stochastic Weight Averaging (SWA) is applied during fine-tuning by accumulating a running average over model checkpoints after the model has reached initial convergence. For PLMs, snapshots are typically collected every 50–100 steps, starting at 50% of the fine-tuning budget, and averaged with a high constant learning rate. The resulting average converges to flatter loss basins and is empirically shown to improve generalization over KD and baseline fine-tuning (Lu et al., 2022).
Uniform and Greedy Model Soups: Given a pool of fine-tuned models (often from a hyperparameter grid), uniform soup averages all weights; greedy soup incrementally builds a subset based on validation accuracy, at each step only adding models that maintain or improve performance (Wortsman et al., 2022). The pseudocode for greedy soup is: $\{\theta_i\}_{i=1}^K$ 1
Amortized Model Ensembling (AME) generalizes PMA into a meta-optimization procedure over the model weights, where model differences serve as “pseudogradients” and adaptive optimizers such as AdamW are used in the weight space (Lee et al., 20 Aug 2025). Model soup is recovered as a special case (one non-adaptive step).
PAC-Bayes Model Averaging employs principled Bayesian aggregation of models, learning a prior from historical tasks and then updating a posterior using new-task data, with guarantees for generalization risk (Huang et al., 2019).
Bayesian Model Averaging (BMA) and Optimizable Model Averaging (OMA) average over frozen foundation models plus new trainable heads, using validated posteriors in BMA or directly optimizing mixture weights to minimize output entropy in OMA (Park, 28 May 2025).
Soup-of-Experts introduces learnable parameterized model averaging: it linearly merges a bank of domain-expert parameter vectors and shared parameters, with combination coefficients predicted by a learned gating MLP that conditions on a desired domain mixture (Ablin et al., 3 Feb 2025).
Moving average ensembling protocols for domain-generalization maintain an SMA or EMA of model weights during fine-tuning and optionally ensemble multiple such averages across independent runs for improved out-of-domain reliability (Arpit et al., 2021, Schmidt et al., 2023).

3. Theoretical Underpinnings and Loss Landscape Geometry

PMA efficacy is grounded in the geometry of loss surfaces for modern deep networks:

Flatness-Minima Correspondence: A flat minimum is characterized by a small maximum Hessian eigenvalue $\lambda_\mathrm{max}$ at the solution. Averaging trajectories perturbed by SGD with high learning rate gravitates toward these central, flat minima, empirically associated with improved generalization (Lu et al., 2022).
Mode Connectivity and Interpolation: In over-parameterized regimes, fine-tuned models from the same pre-trained initialization typically lie in a connected region such that their linear interpolation maintains low loss. This justifies naive weight averaging (“model soups”) as the ensemble remains in a low-error basin (Wortsman et al., 2022, Choshen et al., 2022).
Bias-Variance Decomposition: EMAs and prediction-layer averaging reduce variance due to data sampling or hyperparameter noise, strictly lowering expected OOD loss (Arpit et al., 2021).
Analytical Approximation: For logit ensembles and weight-averaged soups, the difference in expected cross-entropy loss is governed by the curvature along interpolation directions and the confidence of the softmax outputs (Wortsman et al., 2022).
Meta-learning and PAC-Bayes: PMA can be interpreted as finding the Euclidean center of optima or, in PAC-Bayes, as meta-learning a prior that optimizes expected task-adapted generalization bounds (Huang et al., 2019).

4. Practical Implementation and Experimental Evaluation

Empirical protocols for PMA center on effectiveness, minimal overhead, and robustness:

Model Selection: PMA typically eliminates the need for fine-grained hyperparameter selection, as uniform averaging across multiple runs or hyperparameter settings consistently matches or exceeds performance of the best single run, even approaching oracle-tuned performance on cross-lingual tasks (Schmidt et al., 2023).
Snapshot Collection Policy: In trajectory averaging, best results are achieved by collecting 60–120 snapshots post-convergence with high, constant learning rates (Lu et al., 2022). For LLM pre-training, 10 snapshots at 8B–80B token intervals suffice and allow skipping the decay phase (Li et al., 17 May 2025).
Architecture Constraints: All source models to be merged must be strictly compatible in parameterization (Wortsman et al., 2022, Choshen et al., 2022).
Efficiency: Parameter averaging is a simple $\mathcal{O}(|W|)$ operation. For context, SWA, PMA, and model soups add $<10\%$ to training cost and impose no additional inference cost, compared to $\sim2.8\times$ for knowledge distillation and substantial cost for traditional ensembles (Lu et al., 2022, Wortsman et al., 2022, Park, 28 May 2025).
Specialist Instantiation: Soup-of-Experts can generate a small specialist model for arbitrary domain mixtures via a single forward pass through the gating network and vector sum, at inference latency and memory identical to a single model (Ablin et al., 3 Feb 2025).

Performance metrics from landmark studies include:

SWA improves test accuracy on GLUE, SQuAD, and XSum beyond both vanilla tuning and knowledge distillation (Lu et al., 2022).
Greedy soups consistently increase top-1 accuracy on ImageNet, ViT, and CLIP benchmarks, outperform ensembles out-of-distribution (Wortsman et al., 2022).
In cross-lingual zero-shot transfer, PMA delivers up to +5 F1 improvement on NER and near-oracle performance on NLI and QA (Schmidt et al., 2023).
In large-scale LLM pre-training, PMA matches the annealed checkpoint performance with only constant-rate training, saving up to 10–20% compute (Li et al., 17 May 2025).
PAC-Bayes meta-learned priors yield generalization bounds and performance exceeding baselines, especially in noisy or few-shot regimes (Huang et al., 2019).

5. Limitations, Robustness, and Extensions

Known constraints and limitations include:

Basin Connectivity: PMA is most effective when reference checkpoints reside in a common connected low-loss basin. If models are in distant optima, averaging can produce high loss (Wortsman et al., 2022, Choshen et al., 2022).
Architecture Homogeneity: Merging requires exactly matching parameters; no direct merging of models with different architectures or sizes is possible (Wortsman et al., 2022, Choshen et al., 2022).
Data/Task Heterogeneity: Extreme diversity among merging ingredients may erode gains unless adaptive weighting (e.g., AME, BMA, OMA) or careful selection (greedy/learned soups) is used (Lee et al., 20 Aug 2025, Park, 28 May 2025).
Loss of Interpretability: Averaged weights are not guaranteed to be interpretable as an individual expert; knowledge is blended.
Extensions: Adaptive or meta-learned weighting, task-informed fusion, federated variants (FedSoup), and Bayesian or PAC-Bayes perspectives offer paths to improved robustness and enable application to federated averaging, continual learning, and OOD adaptation (Lee et al., 20 Aug 2025, Park, 28 May 2025, Huang et al., 2019).

Ongoing research investigates non-uniform weighting based on validation performance, data domains, Fisher information, or explicit meta-learning objectives, as well as deployment of PMA strategies in massive foundation-model and federated-data settings.

6. Applications, Benchmarks, and Impact

PMA has demonstrated efficacy across diverse settings and scales:

Vision: Substantial accuracy gains on ImageNet, WILDS, domain generalization (DomainBed), and out-of-distribution benchmarks via model soups, SWA, and EoA (Wortsman et al., 2022, Arpit et al., 2021).
NLP: Improved GLUE, SQuAD, NLI, QA, summarization, multi-lingual zero-shot, and domain-specialized models with both trajectory and cross-run averaging (Lu et al., 2022, Schmidt et al., 2023, Ablin et al., 3 Feb 2025).
LLMs: PMA in pre-training allows efficient development of LLMs by averaging constant-rate stable-phase checkpoints, achieving annealed-equivalent performance and rapid convergence in continual/fine-tuning (Li et al., 17 May 2025).
Foundation Model Ensembling: BMA and OMA outperform naïve output-averaging and model soups across image and text classification tasks by optimizing mixture weights over diverse frozen backbones (Park, 28 May 2025).
Meta-learning and Uncertainty: PAC-Bayes PMA provides risk-aware predictions and robust adaptation in the presence of model, data, and task uncertainty, especially in few-shot and heterogeneous-task settings (Huang et al., 2019).
Specialist Synthesis: Soup-of-Experts yields generalist-to-specialist transfer with the ability to instantiate domain-specialized models instantly from a shared parameter bank (Ablin et al., 3 Feb 2025).

7. Theoretical and Practical Guidelines

Effective deployment of PMA relies on certain universal principles:

Uniform averaging suffices when models are well-behaved and similarly optimized. Greedy and learned weighting offer marginal but robust gains, especially with diverse model pools.
Incorporating adaptive optimizers (e.g. AdamW) into weight aggregation, as in AME, improves resilience under heavy-tailed distributions and OOD regimes (Lee et al., 20 Aug 2025).
Monitoring parameter-space convergence ( $\|\theta^{(k)}_\mathrm{avg} - \theta^{(k-1)}_\mathrm{avg}\|_2$ ) provides a data-free stopping criterion (Schmidt et al., 2023).
For moving average protocols, regular snapshotting (every 1–2% of steps) post-warmup is sufficient; performance plateaus after a handful of averaged runs (typically $\{\theta_i\}_{i=1}^K$ 0 suffices in practice) (Schmidt et al., 2023, Lu et al., 2022).
When model architectures or data sources are non-uniform, prefer adaptive, entropy-minimizing or Bayesian weighting (Park, 28 May 2025, Huang et al., 2019).
For foundation model ensembling, OMA offers scalable, low-cost incorporation of new models and stronger OOD robustness than output averaging or BMA (Park, 28 May 2025).

PMA strategies have reshaped best practices in transfer learning, pre-training, model selection, domain generalization, continual learning, and scalable deployment, making them a cornerstone for reliable machine learning system design in the era of large-scale and heterogeneous models.