Parameter-Efficient Expert Composition

Updated 17 December 2025

Parameter-efficient expert composition is a modular approach that fuses low-cost, fine-tuned modules using linear operations and gating mechanisms.
It enables efficient multi-task transfer, distribution generalization, and continual learning by synthesizing diverse expert behaviors without full-model fine-tuning.
Advanced techniques such as direct summation, learned gating, and subspace merging address parameter conflicts while dynamically adapting to contextual needs.

Parameter-efficient expert composition refers to modular schemes enabling the combination or synthesis of multiple fine-tuned, lightweight parameter-efficient modules—each specialized for distinct domains, tasks, or skills—without incurring the memory or compute costs associated with full-model fine-tuning or joint training. Such approaches exploit linearity, subspace structure, or explicit gating to aggregate the expertise of multiple modules, thereby supporting efficient distribution generalization, continual learning, multi-task transfer, and composite behavior synthesis. Strategies for expert composition range from direct parameter-space arithmetic and soft gating to advanced subspace and importance-weighted merging, unified by the imperative of strict parameter- and data-efficiency.

1. Core Concepts: Parameter-Efficient Modules and Weight-Space Arithmetic

Parameter-efficient expert composition is predicated on the wide adoption of parameter-efficient fine-tuning (PEFT) methods such as Low-Rank Adaptation (LoRA) and Inverse Attention Alignment and Amplification ((IA)³). Given a frozen large pretrained model, PEFT injects small, trainable modules—typically low-rank matrices or per-layer scaling vectors—whose footprint is ≤1% of the backbone parameter count. Each module, or Parameter-Efficient Module (PEM), is fine-tuned independently on distinct domains or tasks, resulting in a library of domain “experts” (Patel et al., 24 Jan 2025, Asadi et al., 23 Feb 2024).

The central operational insight is that these PEMs—due to their weight sharing and common initialization—lie in a shared low-dimensional subspace amenable to linear operations. Composer functions operate directly in parameter space, most commonly via weighted sums: $\bm{\Theta}_{P} = \sum_{i=1}^k \lambda_i\,\bm{\theta}_{T_i}$ where $\bm{\theta}_{T_i}$ denotes the parameters of the $i$ -th fine-tuned PEM, and $\sum_i \lambda_i = 1$ . In the degenerate case, $\lambda_i=1$ yields a simple additive merge.

Uniform and learned composition have been applied in both vision (e.g., ViT-Base) and language (e.g., T5-Large) settings, demonstrating significant improvements in few-shot transfer accuracy versus full fine-tuning or single LoRA retraining (Asadi et al., 23 Feb 2024). These methods introduce no new parameters at composition time and do not require further data or gradient steps.

2. Composition Mechanisms: Direct Summation, Gated Mixtures, and Routing

2.1 Unweighted and Weighted Summation

Direct summation in parameter space, as exemplified by MBTI trait composition (Patel et al., 24 Jan 2025), exploits linear mode connectivity of fine-tuned modules, yielding composite experts that inherit the constituent behaviors of their parts. For instance, LoRA or (IA)³ PEMs trained on individual MBTI dichotomies are simply summed (or weighted-summed) to instantiate any of the $2^4=16$ composite personality types, achieving >50% accurate trait classification rates across most dichotomies without any post-merge fine-tuning.

Learned weighting permits data-driven optimization of mixture coefficients ( $\lambda_i$ ) to maximize a downstream alignment score on a small held-out set. This two-step routine yields robustness to domain-specific strength and conflict, especially when expert behaviors are not entirely orthogonal (Asadi et al., 23 Feb 2024, Zhang et al., 2023).

2.2 Arithmetic Operators for Modular Fusion and Unlearning

Compositional operators extend beyond simple addition. Negation (⊖) of a PEM enables subtraction of learned behaviors: e.g., detoxification of LLMs proceeds by subtracting the toxic component LoRA (via inverting its low-rank factors) from a base LoRA module (Zhang et al., 2023). Arbitrary chains of addition, scalar-weighted mixing, and negation can express tasks such as domain analogy, sequential unlearning, and multi-trait blending.

These operators depend on explicit knowledge of the PEFT scheme’s parameterization:

LoRA: for module $(A,B)$ , $\text{negate}(A,B) = (-A,B)$ .
(IA)³: for scaling vector $m$ , $\text{negate}(m) = 2-m$ .

3. Context-Conditioned and Dynamic Composition

Static weight-space mixing suffices for domain-generalization and multi-trait fusion in settings with known or fixed expertise requirements. Emerging methods introduce dynamic gating networks that compute data-dependent mixture weights at inference time, enabling context-sensitive expert composition (Liu et al., 9 Feb 2025). In PSEC, for instance, a small MLP router maps the observed state to skill coefficients which are then used to synthesize a parameter-weighted policy in real time. Such gating networks are typically lightweight—two-layer MLPs sufficing in practice with negligible parameter overhead.

Expert composition via dynamic routers is particularly powerful for RL and continual learning: state-dependent convex combinations of LoRA “skills” enable on-the-fly interpolation among reward-seeking and safety-constrained behaviors, outperforming both fixed-weight and “action-space” fusion in multi-objective settings. This arrangement sustains both transfer and catastrophic-forgetting resistance under limited data, as evidenced by forward transfer and area-under-curve gains in challenging Lifelong Robot Learning benchmarks (Liu et al., 9 Feb 2025, Lei et al., 6 Jun 2025).

4. Parameter-Efficient MoE Architectures and Hybrid Schemes

Parameter-efficient composition also underpins modern Mixture-of-Experts (MoE) designs. Lightweight experts—each a LoRA or (IA)³ module of sub-1% size—are managed by trainable routers that allocate mixture weights per token or task (Zadouri et al., 2023, Liu et al., 2023, Liu et al., 12 Nov 2024). Several further optimizations enable extreme parameter saving:

Central-tensor sharing via MPO: Matrix product operator decomposition isolates a core “central” tensor, shared among all experts, and small expert-specific auxiliary tensors—reducing parameter costs by >80% vs. conventional MoE (Gao et al., 2022).
Single-task or universal gating: MOELoRA trains a global task-vs.-expert gate, fusing per-task LoRA outputs with static base weights, achieving full-finetuning quality at >99.9% parameter savings (Liu et al., 2023).
Routed PEFT (PERFT): Modularizes adapters per expert and inserts a separate router per MoE or PEFT block, maintaining parallel parameter efficiency and allowing both MoE-style sparse gating and PEFT pruning (Liu et al., 12 Nov 2024).

Soft-merge routing, rather than hard top-k, is universally favored in the extreme PEFT regime for stability and zero-shot cross-domain performance (Zadouri et al., 2023).

5. Advanced Merging: Subspace Alignment, Sensitivity, and Selective Compression

Robust expert composition faces three fundamental obstacles: (a) parameter conflict among idiosyncratically specialized experts, (b) loss of “secondary” singular directions on merging, and (c) parameter budget constraints for inference or storage.

Recent merging schemes incorporate structural priors:

CoPA-Merging (Zeng et al., 24 Feb 2025): Prunes and rescales adapter matrices along main singular directions, and then cross-normalizes per-task gains to avoid task suppression.
Sub-MoE (Li et al., 29 Jun 2025): Adopts adaptive expert clustering (via output similarity), then joint SVD subspace merging; shared $U$ -matrices eliminate conflicting bases, and frequency-weighted $V$ alignment controls expert influence, permitting up to 50% reduction of experts with <15% drop in accuracy.
Expert Merging++ (Zhang et al., 30 Sep 2025): Learns O(KL) layer-wise (and chunk-wise importance-weighted) mixture coefficients using only unlabeled calibration data and hidden/logit alignment objectives, concentrating merging capacity on high-impact layers.

In sensitivity-driven expert allocation (LoRA-SMoE), squared-gradient statistics over a calibration set infer the blocks most critical for task performance under a parameter budget, adaptively assigning the number and location of experts. This approach achieves state-of-the-art accuracy at 1–2% trainable parameter fraction, with negligible computational overhead (Xu et al., 6 May 2025).

6. Applications: Distribution Generalization, Continual Learning, and Modular AI Systems

Parameter-efficient composition enables several high-impact use-cases:

Distribution generalization: Additive LoRA composition interpolates trait, task, or domain specialties with no further gradient steps, enabling, for example, trait-to-personality composition in psychometrics or unseen label-set fusion in vision (Patel et al., 24 Jan 2025, Asadi et al., 23 Feb 2024).
Unlearning and detoxification: Subtracting “toxic” or unwanted module effects via negation achieves target detoxification while preserving pre-existing capabilities (Zhang et al., 2023).
Multi-task and continual learning: Dynamically composed expert mixtures, optionally with coefficient-replay for router stability, allow open-ended skill expansion and lifelong adaptation, avoiding explicit task identifiers and yielding negligible forgetting (Lei et al., 6 Jun 2025).
Composite AI systems: Frameworks like CoE implement global routers that select among a bank of expert LLMs under a parameter budget with task-specific routing, drastically reducing average active parameter counts while maintaining competitive performance on large-scale language and reasoning benchmarks (Jain et al., 2 Dec 2024).

Sophisticated multimodal fusion (e.g., Graft’s CAPS) employs both local functional attribution and global distributional gating to merge experts from heterogeneous domains (e.g., code and mathematics for MLLMs), exhibiting reconfigurable plug-and-play compositionality underlying scalable, domain-adaptive AI (Dai et al., 30 Jun 2025).

7. Trade-offs, Failure Modes, and Frontiers

While purely arithmetic or gating-based expert composition is fast, interpretable, and extremely parameter-efficient, open challenges remain:

Conflict and cancellation: Parameter overlap or destructive interference can degrade performance, especially when experts are not aligned in representation space or are highly correlated; full grid search over weights or learned merging mitigates but does not eliminate this (Patel et al., 24 Jan 2025, Zeng et al., 24 Feb 2025).
Calibration and regularization: Data-light methods (e.g., unsupervised alignment) require careful coefficient initialization and regularization to prevent drift. Only a small calibration set ( $<$ 10 samples/task) is sufficient for effective merging in practice (Zhang et al., 30 Sep 2025).
Expressivity and heterogeneity: Extending composition beyond LoRA or (IA)³ to heterogeneous adapter types, architectures, or dimensionalities remains speculative.
Scalability: Multi-expert or multi-domain merging at hundreds of adapter modules may incur quadratic complexity in some advanced fusions (e.g., Sub-MoE).
Open directions: Data-free or Fisher-weighted merges, nonlinear or manifold-based composition, dynamic chunking, and extension to cross-architecture merging represent active research frontiers.

Parameter-efficient expert composition has established itself as a cornerstone for compositional, adaptable, and resource-efficient machine learning, enabling modular design and expert re-use at unprecedented scale and flexibility (Patel et al., 24 Jan 2025, Liu et al., 9 Feb 2025, Asadi et al., 23 Feb 2024, Zeng et al., 24 Feb 2025, Zhang et al., 2023, Zhang et al., 30 Sep 2025, Jain et al., 2 Dec 2024).