Model Soup Parameter Averaging

Updated 25 November 2025

Model soup parameter averaging is a technique that forms a convex combination of neural network weights to harness ensemble-like performance within a single model.
It includes methods such as uniform, greedy, pruned, and learned soups that optimize validation performance without increasing inference or memory costs.
Applications span vision, NLP, diffusion models, and RLHF, demonstrating empirical gains in accuracy and robustness under diverse conditions.

Model soup parameter averaging is a family of techniques in which the parameters (weights) of multiple neural network models are combined by a weighted average to form a single model. This paradigm, introduced in the context of fine-tuned foundation models, enables the benefits of ensemble-like generalization and robustness without incurring increased inference or memory cost. The approach has since diversified into several algorithmic regimes, including checkpoint averaging, greedy/learned soups, adaptive meta-ensembling, domain-specialized parameter mergers, and continual learning architectures. Its practical impact is observed across vision, NLP, diffusion models, and LLM alignment, providing an efficient, data-free means to synthesize improved models from heterogeneous training or fine-tuning runs.

1. Formal Definitions and Core Soup Construction Methods

Let $\{\theta_i\}_{i=1}^{N}$ denote $N$ independently trained or fine-tuned neural network checkpoints (typically obtained from a shared initialization under different hyperparameter configurations or data orders). The canonical model soup is a convex combination of these weights,

$\theta_{\text{soup}} = \sum_{i=1}^N \alpha_i \theta_i, \quad \sum_{i=1}^N \alpha_i = 1,\ \alpha_i \geq 0$

Uniform soup: All $\alpha_i = 1/N$ (Wortsman et al., 2022).
Greedy soup: Ingredients are sequentially added from a sorted list (e.g., by validation accuracy), retaining each only if the averaged validation accuracy does not decrease, ensuring a monotonic or plateau behavior (Wortsman et al., 2022).
Pruned soup: Begin from the full uniform soup, iteratively remove ingredients whose exclusion does not decrease validation performance, yielding sparser and sometimes superior averages (Dansereau et al., 2023).
Learned soup: Optimize $\{\alpha_i\}$ using cross-entropy or other surrogate loss on a held-out validation set; typically solved with gradient-based methods (Wortsman et al., 2022).

All these procedures create a single merged parameter vector that can be directly deployed, using the same inference-time budget as any single constituent model. Empirically, effective soup construction requires that all $\theta_i$ reside in a common low-loss basin of the loss landscape (i.e., exhibit mode connectivity).

2. Theoretical Principles and Equivalence to Ensembling

Model soups operationalize two key theoretical insights:

Loss landscape geometry: In over-parameterized networks, fine-tuned models (from a strong common initialization) tend to occupy a connected low-loss region, allowing linear interpolations to approximately preserve in-distribution and sometimes out-of-distribution performance (Wortsman et al., 2022).
First-order equivalence to ensemble averages: For sufficiently aligned checkpoints (close in weight space), the loss of the weight-averaged model approximates the loss of a prediction (logits or probability) ensemble to first-order via Taylor expansion (Menes et al., 31 Jan 2024). That is,

$\mathcal{L}(f(x; \theta_{\text{soup}}), y) \approx \mathcal{L}\left(\sum \alpha_i f(x;\theta_i), y\right)$

up to quadratic corrections in $\|\theta_i - \theta_0\|$ .

Model soups often perform similarly to or slightly below full probability/logit ensembles in-distribution, but can exceed ensemble performance under distribution shift due to robustness conferred by averaging multiple training trajectories (Wortsman et al., 2022).

3. Algorithmic Variants and Generalizations

The parameter-averaging paradigm supports a spectrum of methods beyond naive uniform averaging:

Resource-Adjusted Souping (RADIN): Given prohibitive $2^N$ subset selection for the optimal soup, RADIN leverages approximated validation scores via cached logits to propose promising soups, followed by a small number of true evaluations, interpolating between uniform, greedy, and oracle selection costs (Menes et al., 31 Jan 2024).
PopulAtion Parameter Averaging (PAPA): Enforces model alignment during training by attracting each member’s weights towards the population mean at fixed intervals. This preserves diversity for ensemble-like generalization yet keeps parameters close enough to be directly averaged, closing most of the gap between ensembles and soups (Jolicoeur-Martineau et al., 2023).
Amortized Model Ensembling (AME): Models the selection of an average as an explicit meta-optimization, in which model differences serve as pseudogradients, enabling adaptive or momentum-based updates and repeated meta-epochs for potentially improved neural averaging (Lee et al., 20 Aug 2025).
Specialist/Domain-Targeted SoE: The "Soup-of-Experts" architecture parameterizes the model as a sum of a shared core and learned expert offsets, combining them at inference with MLP-generated mixing coefficients tailored to arbitrary domain mixture weights (Ablin et al., 3 Feb 2025).
Pruning-Compatible Souping (SMS): Sparse Model Soups enforce a shared mask across all fine-tuned/pruned models, allowing valid parameter averaging without sacrificing target sparsity and achieving robustness beyond vanilla IMP (Zimmer et al., 2023).
Diffusion Soup: Average models fine-tuned on disjoint data shards, theoretically interpolating geometric-mean distributions and supporting continual learning, unlearning, and style blending (Biggs et al., 12 Jun 2024).

4. Empirical Performance and Domain Applications

Model soups have demonstrated consistent empirical gains across vision, NLP, and generative models:

Vision (ImageNet, ViT/CLIP/ALIGN): Greedy and learned soups exceed the best single checkpoint by up to $+1$ pp Top-1 on ImageNet; robustness gains ( $+3$ pp on average under domain shifts) are observed over standard ensembling, all at single-model inference cost (Wortsman et al., 2022).
NLP (GLUE tasks, cross-lingual transfer): Fine-tuned model soups improve GLUE benchmark tasks ( $\sim +0.8\%$ on MRPC, RTE) and token classification. In ZS-XLT settings, accumulative and per-run checkpoint averaging yield clear gains, closely tracking "oracle" performance based on unavailable target-dev sets (Schmidt et al., 2023).
RLHF and LLM alignment (SALSA): Replacing the PPO KL-reference with a soup (typically an average of two independent SFTs) yields superior exploration, reward, and alignment, with consistent $+4$ to $+14$ ppt win-rate improvements on Llama2-7B, Mistral-7B, and Gemma-2B (Chegini et al., 4 Nov 2024).
Diffusion Models: Averaged diffusion soups outperform monolithic training on domain-and-style compositionality metrics (e.g., TIFA, IR, CLIP) while preserving continual/unlearning properties (Biggs et al., 12 Jun 2024).
Sparse Regimes: SMS strictly improves out-of-distribution accuracy and often in-distribution test accuracy over both standard IMP and individual pruned models (Zimmer et al., 2023).

A summary of selected quantitative gains is as follows:

Task/Domain	Soup Type	Single Best	Soup	Ensemble	Reference
ImageNet (ViT-B/32)	Greedy	80.38%	81.03%	—	(Wortsman et al., 2022)
ImageNet (CLIP)	Uniform/Greedy	80.38%	79.97%	—	(Wortsman et al., 2022)
OOD Shift (ViT-B/32, avg)	Greedy	47.83%	50.75%	—	(Wortsman et al., 2022)
RLHF Avg Win-Rate (Llama2-7B)	SALSA (α=0.5)	47.50%	52.50%	—	(Chegini et al., 4 Nov 2024)
CIFAR-10 (Res18)	PAPA AvgSoup	96.8%	97.4%	97.5%	(Jolicoeur-Martineau et al., 2023)
CIFAR-100 (WideResNet20, imp)	SMS	75.54%	76.59%	—	(Zimmer et al., 2023)
Diffusion Domain-Sharded (IR)	Uniform Soup (n=9)	0.34	0.45	—	(Biggs et al., 12 Jun 2024)

5. Practical Constraints, Limitations, and Failure Modes

While model soup parameter averaging is empirically robust, its practical deployment is conditioned on several factors:

Common basin requirement: Uniform soups succeed only when all candidate checkpoints reside in the same low-loss landscape region; divergence, especially for models with batch norm or trained from scratch with different seeds, leads to collapse (random or poor performance) (Dansereau et al., 2023).
Batch norm sensitivity: Averaging models with inconsistent batch norm statistics can lead to ill-calibrated predictions. Re-estimating BN running statistics after soup formation is critical (Dansereau et al., 2023).
Sparse soups: Averaging models with different sparse masks destroys sparsity. SMS solves this by enforcing a shared mask at each pruning phase (Zimmer et al., 2023).
Subset selection complexity: The greedy approach is $O(N)$ in evaluations, and full combinatorial selection is intractable at nontrivial $N$ ; RADIN and learned soups enable cost-accuracy trade-offs (Menes et al., 31 Jan 2024, Wortsman et al., 2022).
Application specificity: For cross-lingual transfer, checkpoint averaging must maintain aligned head architectures, and gains may depend on task taxonomy (sentence, span, or token level) (Schmidt et al., 2023).
Computational/storage cost at soup formation: Constructing soups requires storage of $N$ full checkpoints and potentially their validation logits, but inference cost remains constant.

6. Extensions, Generalizations, and Domain-Specific Constructions

Model soup parameter averaging underpins a series of specialized and generalized frameworks:

Domain-adaptive SoE: "Soup-of-Experts" constructs a large meta-model of $n$ expert offsets and a shared core, producing specialized models for arbitrary domain mixtures by a single linear merge, parameterized by a two-layer MLP mapping domain weights to expert coefficients (Ablin et al., 3 Feb 2025).
Meta-ensembling/“neural averaging”: AME reframes soup computation as a data-free meta-optimization problem, wherein pseudogradients derived from expert differences guide adaptive neural weight updates, yielding generalizations (multiple epochs, Adam, AdaGrad) that empirically outperform both uniform and greedy soups, especially OOD (Lee et al., 20 Aug 2025).
Populational regularization: PAPA exposes a continuous or periodic attraction-to-mean strategy, proactively keeping training trajectories aligned and soup-compatible, increasing robustness against mode collapse and scaling to large populations (Jolicoeur-Martineau et al., 2023).
Diffusion geometric mixing: Diffusion Soup leverages the observation that uniform averaging in parameter space approximates geometric mean mixing of constituent data distributions under a first-order NTK approximation, yielding compositional, anti-memorization, and continual learning properties (Biggs et al., 12 Jun 2024).
Alignment, RLHF, and policy anchoring: SALSA demonstrates that anchoring PPO policy updates to a soup of SFT models relaxes the KL-divergence constraint, enlarges the viable exploration region in policy-space, and yields superior alignment and robustness (Chegini et al., 4 Nov 2024).

7. Recommendations and Best Practices

Deploying model soup parameter averaging requires the following considerations:

Initialization and fine-tuning regime: Use a strong common pretrained initialization and restrict hyperparameter variance sufficiently to maintain basin connectivity.
Selection strategy: Prefer greedy or learned soups over uniform when checkpoint performance varies; validate with a held-out set.
BN re-calibration: For architectures with batch norm, re-compute running statistics post-soup construction.
Evaluation budget: Use RADIN or similar approximate screening to reduce validation cost for large $N$ (Menes et al., 31 Jan 2024).
Sparse model soups: Ensure all averaged copies share the target mask to retain sparsity (e.g., via IMP or structured pruning (Zimmer et al., 2023)).
Combining with SWA, EMA, or bagging: These trajectory-based averaging methods are complementary to soup merging of independently trained checkpoints and often yield additive gains (Wortsman et al., 2022, Jolicoeur-Martineau et al., 2023).
Domain/task considerations: Tailor soup construction and selection to downstream tasks, data partitions, and desired specialization/generality ratio—e.g., SoE for rapid domain specialists or Diffusion Soup for continual model composition.

Model soup parameter averaging has evolved from a fine-tuning trick for foundation models into a principled, theoretically grounded methodology for efficient model synthesis, with broad impact across classification, generative modeling, RL, and transfer learning domains. Theoretical advances in meta-ensembling and neural averaging frameworks continue to broaden its applicability and robustness (Lee et al., 20 Aug 2025, Menes et al., 31 Jan 2024, Ablin et al., 3 Feb 2025).