Model Souping for LLMs

Updated 18 November 2025

Model souping is a technique that creates a single model by convex averaging of isomorphic LLM weights, integrating diverse skills and improving generalization.
It employs various methodologies—including vanilla uniform, learnable, and expert-based (SoCE) soups—to tailor skill transfer and multi-modal fusion without added inference overhead.
Empirical results show that optimized soups can surpass individual model performance on benchmarks like math/code, multilingual tasks, and function calling.

Model souping is a family of methods for integrating multiple instances of LLMs of matching architecture, typically via parameter-wise weighted averaging, to improve aggregate generalization, robustness, or compositionality. Unlike traditional ensembling, souping produces a single set of weights, incurring no added inference cost. Recent advances have extended this paradigm from vanilla full-model averaging to fine-grained mixtures tailored to skill transfer, multi-modal integration, or benchmark-driven expert selection, yielding new state-of-the-art results across a spectrum of LLM tasks (Prabhakar et al., 16 Oct 2024, Bai et al., 11 Jul 2024, Maiti et al., 17 Nov 2025).

1. Foundational Concepts and Definitions

A model soup is defined as any convex linear combination of weight vectors from $n$ models, all isomorphic in architecture:

$\theta^s = \sum_{i=1}^n \alpha^i \theta^i, \qquad \sum_{i=1}^n \alpha^i = 1$

where $\theta^i$ denotes the parameters of model $i$ , and $\alpha^i\geq 0$ are the soup coefficients. For pre-trained or fine-tuned LLMs, this strategy enables amalgamation of diverse competencies, training sources, or modalities in a single model artifact (Bai et al., 11 Jul 2024, Maiti et al., 17 Nov 2025).

A related construct for parameter-efficient fine-tuning is LoRA souping (Prabhakar et al., 16 Oct 2024), in which only the low-rank adaptation modules are amalgamated, not the full weight matrices. This allows modular skill recombination with minimal retraining.

2. Model Souping Methodologies

Three principal branches of souping have emerged:

Vanilla Uniform Soup: Direct parameter-wise averaging, typically $\alpha^i=1/n$ for all $i$ . This method provides out-of-the-box robustness improvements by averaging out idiosyncratic errors of individual models. Empirically, uniform soups often outperform any constituent on global metrics and are trivial to compute.
Learnable or Fine-Grained Souping: Learns soup weights per layer or sub-module ( $\alpha_{s,\ell}$ ), optimizing on a small development set to minimize loss. This allows selective domain or modality retention and reduces destructive interference between skills. For example, in SoupLM, per-layer coefficients are tuned across base models such as Vicuna (text-only) and LLaVA (vision-language), yielding higher average task performance (Bai et al., 11 Jul 2024).
Category-Expert Souping (SoCE): Uses benchmark-based partitioning to select expert models for weakly correlated task clusters, and then optimizes non-uniform weights to maximize global performance. SoCE involves: (1) correlation analysis of per-model per-task performance; (2) expert selection; (3) grid search for the optimal weights; and (4) formation of the soup $\sum_{i=1}^\ell w_i^* M_i^*$ , where each $M_i^*$ is an expert model for a low-correlation benchmark category (Maiti et al., 17 Nov 2025).

Additionally, LoRA souping (CAT) (Prabhakar et al., 16 Oct 2024) enables skill composition for LLMs fine-tuned via LoRA by learning learned linear combinations over low-rank adapters.

3. Algorithmic Formulations and Implementation

Vanilla and Learnable Soup

Given two models with weights $\theta^1, \theta^2$ , vanilla soup computes:

$\theta^s = \alpha \theta^1 + (1-\alpha)\theta^2$

where typically $\alpha=0.5$ . For learnable soup, per-module or per-layer coefficients are tuned on a merge dataset using:

$\theta^s_{s,\ell} = \alpha_{s,\ell} \theta^1_{s,\ell} + (1-\alpha_{s,\ell})\theta^2_{s,\ell}, \qquad 0 \leq \alpha_{s,\ell} \leq 1$

Optimization proceeds with a modest learning rate for several epochs, on as few as 50 development examples per domain (Bai et al., 11 Jul 2024).

LoRA Soups (CAT)

Let $k$ LoRA modules, each trained for a different skill, be represented by sets $(A_i, B_i)$ (low-rank matrices). Per layer $l$ , the CAT method constructs:

$W^l_{\rm merged} = W_0^l + \sum_{i=1}^k \alpha^l_i B^l_i (A^l_i)^\top$

The merging weights $\alpha^l_i$ are optimized via convex loss minimization on a small mixture validation set (typically using gradient descent for a single epoch at low learning rate). Only the $\alpha$ 's are trained in this stage; all LoRA parameters remain frozen. For $k=2$ , initialization $\alpha^l_i=0.5$ is common, with $\alpha$ constrained to $[0,1]$ per layer (Prabhakar et al., 16 Oct 2024).

SoCE: Soup of Category Experts

SoCE operates over a pool of candidate models and a benchmark with $k$ categories:

Compute a $k\times n$ performance matrix $P^i_j$ and Pearson correlation $\rho_{i,j}$ between categories.
Identify weakly correlated clusters $L = \{C_i: \exists j\ne i, |\rho_{i,j}| < \tau\}$ .
For each $C_i \in L$ , select expert $M^*_i = \arg\max_j P^i_j$ .
Optimize weights $w=(w_1,\ldots,w_\ell)$ to maximize overall performance, subject to $\sum_i w_i = 1, \ w_i\geq 0$ , using grid search.
Construct $M_{\rm soup} = \sum_{i=1}^\ell w^*_i M^*_i$ (Maiti et al., 17 Nov 2025).

4. Empirical Results and Comparative Analysis

Empirical evaluation across domains has demonstrated that souping—especially with learned or expert-weighted coefficients—consistently surpasses individual models and naive uniform averaging.

LoRA Soup (CAT) on GSM-Hard (math+code) yields 21.63% execution accuracy vs. 18.80% for joint LoRA retraining (DATA-MIX) and 14.18%/8.04% for skill-only LoRAs. CAT achieves a 43% relative boost over best merging baselines and 12% over data mixing (Prabhakar et al., 16 Oct 2024).
SoupLM achieves up to 1% further improvement over vanilla averaging with learnable per-module/interpolation, and in multi-modal fusion, intermediate Transformer layers gravitate to the modality most relevant for the evaluation task (Bai et al., 11 Jul 2024).
SoCE provides absolute gains up to 2.7% (70B models on BFCL), 4.1% (8B models), and outperforms uniform soups on multilingual and function-calling benchmarks. SoCE solves 8.4% of BFCL tasks that all parental models fail, indicating genuine compositional synergy (Maiti et al., 17 Nov 2025).

Table: Example Outcomes for Soup Variants

Method	Domain	Best Individual	Uniform Soup	SoCE/CAT Gain
CAT (LoRA Soup)	GSM-Hard	14.18%	18.80%*	21.63% (CAT)
SoCE (BFCL, 70B)	Func-Calling	78.56%	68.33%	80.68% (SoCE)
SoCE (MGSM, 7B)	Multilingual	50.9%	47.0%	51.7% (SoCE)

*Uniform soup for LoRA is joint retraining (DATA-MIX).

5. Strengths, Limitations, and Practical Guidance

Advantages of model souping include modular skill integration, zero inference overhead, ability to leverage small validation sets for tuning, and compatibility with both full-model and adapter-based finetuning regimes. LoRA soups and expert-based soups are compute-efficient alternatives to retraining or ensembling.

Documented limitations:

LoRA CAT merging degrades for $k>2$ —performance gains decline when combining more than two adapters, and DATA-MIX may outperform merging for $k=3$ (Prabhakar et al., 16 Oct 2024).
Performance gains from SoCE shrink when task categories show high mutual correlation or when candidate models lack sufficient diversity (Maiti et al., 17 Nov 2025).
Souping is only well-posed for isomorphic architectures; architectural drift or discriminator layers (e.g. additional LoRA adapters, non-matching heads) invalidate the operation (Bai et al., 11 Jul 2024, Maiti et al., 17 Nov 2025).
Overfitting risk exists if soup weights are tuned on test rather than held-out dev sets; benchmarks without meaningful sub-categories limit expert selection.
Souping across very heterogeneous models or recipes can lead to regression unless expert selection is stringent.

Practitioners are advised to:

Default to uniform average when integrating two or more models of equal standing.
Use fine-grained learned $\alpha$ when small domain-specific validation is available.
For LoRA, merge only after full convergence of individual adapters, initializing $\alpha$ uniformly and clipping post-update.
In category-expert scenarios, perform correlation partitioning, expert picking, and weight optimization as in SoCE.
Avoid mixing models with structural mismatches.

6. Impact and Outlook

Model souping constitutes an increasingly important post-training technique for LLM specialization, skill composition, and rapid domain adaptation. It enables practitioners to synthesize new functionalities by arithmetic manipulation of checkpoints, bypassing retraining expense. Empirical results on compositional math/code benchmarks, vision-language fusion, function calling, and multilingual reasoning confirm that—when appropriately optimized—souped models match or exceed the performance of the best constituent and even approach the upper bound set by retrieval augmentation in certain setups (Prabhakar et al., 16 Oct 2024, Bai et al., 11 Jul 2024, Maiti et al., 17 Nov 2025).

Research continues on addressing multi-skill ( $k>2$ ) merging, cross-modal soups, and developing soup-aware evaluation benchmarks. The geometry of soup parameter space, as probed by regularized interpolation and per-layer $\alpha$ heatmaps, is an active area for interpretability and model selection. Model souping stands as a practical, extensible, and robust method for aggregating capacity in large-scale language and multi-modal models.