LoRA Soups: Modular Skill Composition

Updated 8 February 2026

LoRA Soups are model-merging techniques that combine multiple skill-specific LoRA adapters to form a robust, modular update for LLMs.
They employ static (offline) and dynamic (online) approaches, such as CAT and token-level routing, to optimize parameter efficiency and performance.
Empirical results show significant accuracy and robustness improvements over classical data-mixing and ensemble methods in various benchmarks.

LoRA Soups are model-merging techniques for composing and deploying multiple skill-specific LoRA adapters within LLMs. Under the “LoRA soup” paradigm, pre-trained LoRA modules, each optimized for a distinct subtask or data source, are combined post hoc to enable robust skill composition for downstream settings where unified data or full retraining is impractical. Recent research establishes both static (offline, merged-once) and dynamic (online, input-conditional) LoRA soup approaches that significantly outperform classical data-mixing, vanilla ensemble, or gating-based compositions in parameter efficiency, robustness, and practical utility (Prabhakar et al., 2024, Belofsky, 2023, Lee et al., 10 Nov 2025).

1. Formalization and Operator Foundations

Let $W_0 \in \mathbb{R}^{d \times k}$ denote a frozen pre-trained parameter matrix. A LoRA adapter introduces a rank- $r$ update $\Delta W$ , typically instantiated as $B A^\top$ where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{k \times r}$ , $r \ll \min(d, k)$ . For $M$ independently trained skill-specific adapters $\{\Delta W_i\}_{i=1}^M$ , a “LoRA soup” forms a composite update: $\Delta W^{(\ell)} = \sum_{i=1}^M \alpha_i^{(\ell)} B_i^{(\ell)} A_i^{(\ell)\top}$ for $0 \leq \alpha_i^{(\ell)} \leq 1$ , typically normalized post hoc or regularized during calibration (Prabhakar et al., 2024). This merging mechanism is modular: each skill’s effect is preserved without cross-terms, and the merge can be performed layer-wise.

In dynamic settings, $\alpha_i$ may become an explicit function of the current context (prompt, token, or hidden state) as in token-level (Belofsky, 2023) or instance-level (Lee et al., 10 Nov 2025) soups.

2. Principal LoRA Soup Methodologies

2.1 CAT (Concatenation of LoRAs)

The CAT strategy (Prabhakar et al., 2024) merges several skill-specific LoRA adapters via a weighted sum, with coefficients $\alpha_i^{(\ell)}$ optimized on a small held-out calibration set $D_{\text{cal}}$ . Formally, for merged weights: $W^{(\ell)} = W_0^{(\ell)} + \sum_{i=1}^M \alpha_i^{(\ell)} B_i^{(\ell)} A_i^{(\ell)\top}$ The $\alpha$ coefficients are tuned by minimizing task loss on $D_{\text{cal}}$ , typically in a single epoch with modern optimizers. CAT outperforms naive parameter averaging, data-mix fine-tuning, MoE-style routers, and prior merge schemes (TIES, DARE) by significant margins. Unlike linear parameter averaging, CAT avoids cross-interaction terms $B_i A_j^\top$ for $i \neq j$ , preserving adapter modularity.

2.2 Token-Level Dynamic LoRA Soups

Token-level LoRA soups (Belofsky, 2023) enact adaptive composition at each generation step. For prediction at token $x_t$ , routing weights are computed as: $\alpha_i(x_t) = \frac{\exp(s_i \cdot T_i)}{\sum_j \exp(s_j \cdot T_j)}$ where $s_i = \cos(p(x_{<t}), a_i)$ (similarity between prefix embedding and adapter centroid), $T_i=4$ for $\arg\max_i s_i$ and $T_i=1$ otherwise. The active LoRA update is: $\Delta W(x_t) = \sum_{i=1}^M \alpha_i(x_t) \Delta W_i$ And the effective weights per generation step are $W(x_t) = W_0 + \Delta W(x_t)$ . This allows on-the-fly “stirring” of adapters depending on token-level context similarity.

2.3 LoRA on the Go (LoGo): Instance-Level Dynamic Selection

LoGo (Lee et al., 10 Nov 2025) generalizes dynamic composition by extracting adapter activations $\{s_i\}$ from a single forward pass—using either $\ell_2$ norm or inverse entropy of adapter outputs $\mathbf{o}_{i,T}$ . The top- $k$ relevant adapters (by $s_i$ ) are selected, and their outputs mixed as: $\mathbf{o}_{\mathrm{merge}} = \sum_{i \in \mathcal{S}} \tilde w_i \mathbf{o}_{i,T}$ with $\tilde w_i = \frac{s_i}{\sum_{j \in \mathcal{S}} s_j}$ . Output-based (mixture) and parameter-based (fusion) variants are both studied; mixture is preferred for efficiency. LoGo is entirely training-free and instance-adaptive.

3. Theoretical and Practical Properties

Table: Comparison of LoRA Soup Schemes

Approach	Mixing Granularity	$\alpha_i$ Selection	Calibration
CAT	Static (offline)	Optimized on $D_{\text{cal}}$	Required, small
Token-level	Per-token	Gradient-free, similarity	None (at inference)
LoGo	Per-instance/block	Forward-pass probe (norm/entropy)	None

CAT preserves exact skill updates without cross-terms, ensuring skills remain modular. Dynamic soups (token-level, instance-level) allow context-driven, fine-grained composition—adapting skills to the evolving input.

All approaches reduce catastrophic forgetting typical of data-mixing baselines and are parameter-efficient. CAT’s $\alpha$ -learning requires less than 1% of skill LoRA fine-tuning compute, while LoGo adds negligible inference overhead ( $<$ ~1.87 sec/sample, LLaMA-3.1-8B, single GPU) for up to $k=10$ adapters (Lee et al., 10 Nov 2025).

4. Empirical Performance and Benchmarking

4.1 CAT Results

On GSM-Hard (math+code composition), CAT achieves 21.11% execution accuracy versus 14.18% (math LoRA), 8.04% (code LoRA), 18.80% (DATA-MIX), and 16–18% for other merge baselines. This translates to a +48.8% absolute increase over the base and a 257% super-linear gain, demonstrating genuine compositional generalization (Prabhakar et al., 2024).

For proprietary manual Q&A, CAT yields 58% accuracy (closed-book, GPT-4 judge) compared to 27–54% for individual and mixed baselines, approaching open-book upper bounds while requiring zero retrieval.

On technical reading comprehension, CAT achieves ELO 210 (vs. 190 for DATA-MIX, 193 for MoE, 150–180 for single skills).

Prompt-format robustness also benefits from CAT: accuracy remains stable ( $\approx80\%$ across unseen formats) where data-mix baselines degrade sharply.

4.2 Dynamic and Instance-Level LoRA Soups

Token-level dynamic routing (k=2, every-other-token) achieves 48.3% average accuracy across ARC-Challenge, GSM8K, CodeAlpaca-20k, and SQuAD, outperforming both single-task and per-token merging (k ≠ 2) settings. The optimal adaptation interval $k=2$ balances noise and adaptability (Belofsky, 2023).

LoGo’s instance-level merging over 27 diverse datasets delivers average accuracy (LLaMA-3.1-8B, top-20 adapters) of 40.0 (entropy probe), surpassing training-based LoRAHub (40.3) and mixture-of-experts retriever (40.4). Struct-to-text and NLI tasks show gains up to +3.6% over Base; code generation on CodeXGLUE sees LoGo (14.4 BLEU) exceeding LoRARetriever (13.3) (Lee et al., 10 Nov 2025).

5. Limitations, Design Choices, and Practical Recommendations

5.1 Limitations

CAT focus is on binary composition ( $k=2$ ); extending to $k \gg 2$ can see data-mix baselines overtake (Prabhakar et al., 2024).
Linear combination assumption may not capture nonlinear skill interactions.
Token-level and instance-level soups do not refine $\alpha$ via backpropagation; the gradient-free design is chosen for efficiency.

5.2 Design and Tuning

CAT: Use layerwise $\alpha$ coefficients trained on 5% calibration data; gains are robust to static vs. learned $\alpha$ but optimizing yields additional improvement.
Token-level mixes: Adaptation interval $k$ is a key hyperparameter; $k=2$ offers best tradeoff between adaptation and noise (Belofsky, 2023).
LoGo: Number of merged adapters ( $k$ ) plateaus effect ( $k\geq10$ ); probing with “last” hidden state is marginally superior; output-based mixture is 2–3× faster than parameter fusion.

5.3 Practical Usage

CAT is recommended whenever skill decomposition is possible; compute cost is dominated by initial skill LoRA training.
Dynamic and instance-level soups require minimal or no additional data or tuning, making them practical for real-world deployment.

6. Research Directions and Applications

Current LoRA soup methodologies are particularly suited for modular skill composition, cross-domain transfer, and environments where retraining on union data is not feasible. Empirical evidence supports their superiority for composing distinct skills required in math+code QA, proprietary-domain Q&A, technical reading, and prompt robustness (Prabhakar et al., 2024).

Planned extensions include:

Generalization to $k \gg 2$ domains via smart calibration or hierarchical merging.
Theoretical studies of skill adapter subspace geometry.
Dynamic, input-adaptive weightings $\alpha(\cdot)$ via hybrid approaches (CAT+MoE).
Multimodal and RL agent applications (Prabhakar et al., 2024).

This suggests LoRA soups represent a robust, compute-efficient paradigm for skill composition in LLMs, bridging the efficacy of parameter-efficient tuning and the tractability of modular deployment.

References

"LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks" (Prabhakar et al., 2024)
"Token-Level Adaptation of LoRA Adapters for Downstream Task Generalization" (Belofsky, 2023)
"LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging" (Lee et al., 10 Nov 2025)

Markdown Upgrade to Chat

References (3)

LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks (2024)

Token-Level Adaptation of LoRA Adapters for Downstream Task Generalization (2023)

LoRA on the Go: Instance-level Dynamic LoRA Selection and Merging (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoRA Soups.