Papers
Topics
Authors
Recent
2000 character limit reached

LoRA Soups: Modular Skill Composition

Updated 8 February 2026
  • LoRA Soups are model-merging techniques that combine multiple skill-specific LoRA adapters to form a robust, modular update for LLMs.
  • They employ static (offline) and dynamic (online) approaches, such as CAT and token-level routing, to optimize parameter efficiency and performance.
  • Empirical results show significant accuracy and robustness improvements over classical data-mixing and ensemble methods in various benchmarks.

LoRA Soups are model-merging techniques for composing and deploying multiple skill-specific LoRA adapters within LLMs. Under the “LoRA soup” paradigm, pre-trained LoRA modules, each optimized for a distinct subtask or data source, are combined post hoc to enable robust skill composition for downstream settings where unified data or full retraining is impractical. Recent research establishes both static (offline, merged-once) and dynamic (online, input-conditional) LoRA soup approaches that significantly outperform classical data-mixing, vanilla ensemble, or gating-based compositions in parameter efficiency, robustness, and practical utility (Prabhakar et al., 2024, Belofsky, 2023, Lee et al., 10 Nov 2025).

1. Formalization and Operator Foundations

Let W0Rd×kW_0 \in \mathbb{R}^{d \times k} denote a frozen pre-trained parameter matrix. A LoRA adapter introduces a rank-rr update ΔW\Delta W, typically instantiated as BAB A^\top where BRd×rB \in \mathbb{R}^{d \times r}, ARk×rA \in \mathbb{R}^{k \times r}, rmin(d,k)r \ll \min(d, k). For MM independently trained skill-specific adapters {ΔWi}i=1M\{\Delta W_i\}_{i=1}^M, a “LoRA soup” forms a composite update: ΔW()=i=1Mαi()Bi()Ai()\Delta W^{(\ell)} = \sum_{i=1}^M \alpha_i^{(\ell)} B_i^{(\ell)} A_i^{(\ell)\top} for 0αi()10 \leq \alpha_i^{(\ell)} \leq 1, typically normalized post hoc or regularized during calibration (Prabhakar et al., 2024). This merging mechanism is modular: each skill’s effect is preserved without cross-terms, and the merge can be performed layer-wise.

In dynamic settings, αi\alpha_i may become an explicit function of the current context (prompt, token, or hidden state) as in token-level (Belofsky, 2023) or instance-level (Lee et al., 10 Nov 2025) soups.

2. Principal LoRA Soup Methodologies

2.1 CAT (Concatenation of LoRAs)

The CAT strategy (Prabhakar et al., 2024) merges several skill-specific LoRA adapters via a weighted sum, with coefficients αi()\alpha_i^{(\ell)} optimized on a small held-out calibration set DcalD_{\text{cal}}. Formally, for merged weights: W()=W0()+i=1Mαi()Bi()Ai()W^{(\ell)} = W_0^{(\ell)} + \sum_{i=1}^M \alpha_i^{(\ell)} B_i^{(\ell)} A_i^{(\ell)\top} The α\alpha coefficients are tuned by minimizing task loss on DcalD_{\text{cal}}, typically in a single epoch with modern optimizers. CAT outperforms naive parameter averaging, data-mix fine-tuning, MoE-style routers, and prior merge schemes (TIES, DARE) by significant margins. Unlike linear parameter averaging, CAT avoids cross-interaction terms BiAjB_i A_j^\top for iji \neq j, preserving adapter modularity.

2.2 Token-Level Dynamic LoRA Soups

Token-level LoRA soups (Belofsky, 2023) enact adaptive composition at each generation step. For prediction at token xtx_t, routing weights are computed as: αi(xt)=exp(siTi)jexp(sjTj)\alpha_i(x_t) = \frac{\exp(s_i \cdot T_i)}{\sum_j \exp(s_j \cdot T_j)} where si=cos(p(x<t),ai)s_i = \cos(p(x_{<t}), a_i) (similarity between prefix embedding and adapter centroid), Ti=4T_i=4 for argmaxisi\arg\max_i s_i and Ti=1T_i=1 otherwise. The active LoRA update is: ΔW(xt)=i=1Mαi(xt)ΔWi\Delta W(x_t) = \sum_{i=1}^M \alpha_i(x_t) \Delta W_i And the effective weights per generation step are W(xt)=W0+ΔW(xt)W(x_t) = W_0 + \Delta W(x_t). This allows on-the-fly “stirring” of adapters depending on token-level context similarity.

2.3 LoRA on the Go (LoGo): Instance-Level Dynamic Selection

LoGo (Lee et al., 10 Nov 2025) generalizes dynamic composition by extracting adapter activations {si}\{s_i\} from a single forward pass—using either 2\ell_2 norm or inverse entropy of adapter outputs oi,T\mathbf{o}_{i,T}. The top-kk relevant adapters (by sis_i) are selected, and their outputs mixed as: omerge=iSw~ioi,T\mathbf{o}_{\mathrm{merge}} = \sum_{i \in \mathcal{S}} \tilde w_i \mathbf{o}_{i,T} with w~i=sijSsj\tilde w_i = \frac{s_i}{\sum_{j \in \mathcal{S}} s_j}. Output-based (mixture) and parameter-based (fusion) variants are both studied; mixture is preferred for efficiency. LoGo is entirely training-free and instance-adaptive.

3. Theoretical and Practical Properties

Table: Comparison of LoRA Soup Schemes

Approach Mixing Granularity αi\alpha_i Selection Calibration
CAT Static (offline) Optimized on DcalD_{\text{cal}} Required, small
Token-level Per-token Gradient-free, similarity None (at inference)
LoGo Per-instance/block Forward-pass probe (norm/entropy) None

CAT preserves exact skill updates without cross-terms, ensuring skills remain modular. Dynamic soups (token-level, instance-level) allow context-driven, fine-grained composition—adapting skills to the evolving input.

All approaches reduce catastrophic forgetting typical of data-mixing baselines and are parameter-efficient. CAT’s α\alpha-learning requires less than 1% of skill LoRA fine-tuning compute, while LoGo adds negligible inference overhead (<<~1.87 sec/sample, LLaMA-3.1-8B, single GPU) for up to k=10k=10 adapters (Lee et al., 10 Nov 2025).

4. Empirical Performance and Benchmarking

4.1 CAT Results

On GSM-Hard (math+code composition), CAT achieves 21.11% execution accuracy versus 14.18% (math LoRA), 8.04% (code LoRA), 18.80% (DATA-MIX), and 16–18% for other merge baselines. This translates to a +48.8% absolute increase over the base and a 257% super-linear gain, demonstrating genuine compositional generalization (Prabhakar et al., 2024).

For proprietary manual Q&A, CAT yields 58% accuracy (closed-book, GPT-4 judge) compared to 27–54% for individual and mixed baselines, approaching open-book upper bounds while requiring zero retrieval.

On technical reading comprehension, CAT achieves ELO 210 (vs. 190 for DATA-MIX, 193 for MoE, 150–180 for single skills).

Prompt-format robustness also benefits from CAT: accuracy remains stable (80%\approx80\% across unseen formats) where data-mix baselines degrade sharply.

4.2 Dynamic and Instance-Level LoRA Soups

Token-level dynamic routing (k=2, every-other-token) achieves 48.3% average accuracy across ARC-Challenge, GSM8K, CodeAlpaca-20k, and SQuAD, outperforming both single-task and per-token merging (k ≠ 2) settings. The optimal adaptation interval k=2k=2 balances noise and adaptability (Belofsky, 2023).

LoGo’s instance-level merging over 27 diverse datasets delivers average accuracy (LLaMA-3.1-8B, top-20 adapters) of 40.0 (entropy probe), surpassing training-based LoRAHub (40.3) and mixture-of-experts retriever (40.4). Struct-to-text and NLI tasks show gains up to +3.6% over Base; code generation on CodeXGLUE sees LoGo (14.4 BLEU) exceeding LoRARetriever (13.3) (Lee et al., 10 Nov 2025).

5. Limitations, Design Choices, and Practical Recommendations

5.1 Limitations

  • CAT focus is on binary composition (k=2k=2); extending to k2k \gg 2 can see data-mix baselines overtake (Prabhakar et al., 2024).
  • Linear combination assumption may not capture nonlinear skill interactions.
  • Token-level and instance-level soups do not refine α\alpha via backpropagation; the gradient-free design is chosen for efficiency.

5.2 Design and Tuning

  • CAT: Use layerwise α\alpha coefficients trained on 5% calibration data; gains are robust to static vs. learned α\alpha but optimizing yields additional improvement.
  • Token-level mixes: Adaptation interval kk is a key hyperparameter; k=2k=2 offers best tradeoff between adaptation and noise (Belofsky, 2023).
  • LoGo: Number of merged adapters (kk) plateaus effect (k10k\geq10); probing with “last” hidden state is marginally superior; output-based mixture is 2–3× faster than parameter fusion.

5.3 Practical Usage

  • CAT is recommended whenever skill decomposition is possible; compute cost is dominated by initial skill LoRA training.
  • Dynamic and instance-level soups require minimal or no additional data or tuning, making them practical for real-world deployment.

6. Research Directions and Applications

Current LoRA soup methodologies are particularly suited for modular skill composition, cross-domain transfer, and environments where retraining on union data is not feasible. Empirical evidence supports their superiority for composing distinct skills required in math+code QA, proprietary-domain Q&A, technical reading, and prompt robustness (Prabhakar et al., 2024).

Planned extensions include:

  • Generalization to k2k \gg 2 domains via smart calibration or hierarchical merging.
  • Theoretical studies of skill adapter subspace geometry.
  • Dynamic, input-adaptive weightings α()\alpha(\cdot) via hybrid approaches (CAT+MoE).
  • Multimodal and RL agent applications (Prabhakar et al., 2024).

This suggests LoRA soups represent a robust, compute-efficient paradigm for skill composition in LLMs, bridging the efficacy of parameter-efficient tuning and the tractability of modular deployment.

References

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LoRA Soups.