Continual Mixture-of-Experts Adapter

Updated 14 November 2025

The paper introduces a modular architecture that preserves learned knowledge while dynamically adding adapter experts upon detecting distribution shifts.
It leverages autoencoder-based reconstruction errors and a trainable router to detect out-of-distribution samples and selectively expand model capacity.
Empirical results demonstrate sub-linear parameter growth with improved accuracy on standard continual learning benchmarks compared to per-task expansion methods.

Continual Mixture-of-Experts Adapter (CMoE-Adapter) refers to a modular architecture for continual learning in neural networks, specifically designed to manage the stability–plasticity dilemma—preserving previously acquired knowledge when adapting to new, non-stationary data distributions. The CMoE-Adapter, as exemplified by the SEMA framework (Wang et al., 27 Mar 2024), leverages parallel adapter experts, explicit distribution-shift detection at multiple representation levels, and a trainable mixture router, enabling adaptive sub-linear model expansion and enhanced parameter efficiency without memory rehearsal.

1. Core Architectural Components

The CMoE-Adapter inserts a series of modular adapter blocks into a frozen pre-trained model (PTM; e.g., a Vision Transformer), augmenting each Transformer block ℓ with a pool of $M^\ell$ adapters. Each adapter block consists of:

Functional Adapter: A bottleneck MLP of form down-project $\to$ activation $\to$ up-project, operating on an input $x^\ell \in \mathbb{R}^d$ :

$f(x^\ell) = \mathrm{ReLU}(x^\ell W_\downarrow) W_\uparrow,\quad W_\downarrow \in \mathbb{R}^{d \times r},\;\; W_\uparrow \in \mathbb{R}^{r \times d}$

Multiple adapters $f_i(\cdot)$ can be instantiated within a layer.

Representation Descriptor: Each functional adapter is paired with an autoencoder $g_i(x^\ell) = (p_\theta, q_\phi)$ trained to reconstruct its input:

$\mathcal{L}_\mathrm{RD} = \mathbb{E}_{x^\ell} \| x^\ell - g_i(x^\ell) \|^2$

The test-time reconstruction error $e_\text{recon} = \| x^\ell - g_i(x^\ell) \|^2$ serves as a distribution-shift indicator.

These modules remain “frozen” once trained for a given task, serving as immutable knowledge experts for future routing and reuse.

2. Mixture Routing and Expert Composition

A crucial aspect is the dynamically expandable soft-weighting router at each injection point:

Router: A linear head $h^\ell(x^\ell; W_r): \mathbb{R}^d \to \mathbb{R}^{M^\ell}$ , parameterized by $W_r \in \mathbb{R}^{d \times M^\ell}$ , producing softmax weights $w_i^\ell = \mathrm{softmax}(h^\ell(x^\ell))_i$ .
Adapter Output: The block output aggregates the frozen layer value and all adapters via the mixture weights:

$y^\ell = \mathrm{MLP}(x^\ell) + \sum_{i=1}^{M^\ell} w_i^\ell f_i(x^\ell)$

or generically $y=\sum_{i=1}^N g_i(h) \mathrm{Adapter}_i(h)$ .

When new experts are added (in response to detected distribution shifts), the router is expanded by an additional output dimension, whose parameters are trained on the present task, while previous columns and adapters are strictly frozen. This mechanism enforces hard knowledge preservation.

3. Automatic Self-Expansion via Shift Detection

Distribution-shift detection underpins adaptive model growth:

Descriptor Statistics: For each representation descriptor $g_i$ , running means and standard deviations $(\mu_\text{recon}, \sigma_\text{recon})$ of reconstruction errors are maintained.
Trigger Z-score: $z_i = \frac{e_\text{recon} - \mu_\text{recon}}{\sigma_\text{recon}}$ . If all experts yield $z_i > z'$ (for some threshold $z'$ ), the layer input is deemed OOD by all adapters.
Expansion Protocol: At expansion, a new functional adapter $f_\text{new}$ and descriptor $g_\text{new}$ are appended to the layer, the router $W_r$ gains an extra column (initialized at zero), and only the new expert/column is trained. The mechanism stops after the first expansion each task to minimize growth.

$\text{Expand at layer } \ell\;\text{if}\quad \min_{i\leq M^\ell} D(\mathcal{D}_t,\phi_i^\ell) > \tau \quad \Leftrightarrow \quad \min_i z_i > z'$

where $\phi_i^\ell$ is the $i$ -th adapter’s descriptor; $D$ denotes the divergence proxy.

4. Training Objectives and Stability–Plasticity Balance

The training objective on new task $t$ is split:

Classification Loss: Trains new adapters/routers:

$\mathcal{L}_\text{CE} = \mathbb{E}_{(x,y)\in \mathcal{D}^t} \ell(f_\theta(x),y)$

$f_\theta$ comprises all frozen experts (mixed via the router) plus the newly added adapter.

Descriptor Loss: The autoencoder descriptor is trained on reconstruction only:

$\mathcal{L}_\mathrm{RD} = \mathbb{E}_{x^\ell} \| x^\ell - g_\text{new}(x^\ell)\|^2$

There is no need for explicit regularization on frozen experts—stability is guaranteed by freezing parameters, while all plasticity is confined to the new adapter/router.

5. Continual Learning Protocol and Sub-linear Parameter Growth

The overall protocol advances sequentially:

Task Sequence: For each new task, detection is run top-down over layers; expansion occurs only at the first layer (if any) that triggers. If triggered, only the new adapter/router column is trained; otherwise, no parameters are updated.
No Rehearsal: The method foregoes memory-based rehearsal strategies; prior data are never revisited.
Parameter Efficiency: Because expansion occurs strictly on genuine distribution shift, parameter growth is sub-linear: for 20 tasks, total new parameters are $0.5$–$1$M vs. $4$M+ in per-task expansion methods.

Adapters are reused across incoming tasks wherever feature distributions are similar, maximizing parameter efficiency and computational resource sharing.

6. Empirical Validation and Ablation Analyses

Performance metrics on standard CL benchmarks substantiate the approach:

Final accuracy ( $\mathcal{R}_N$ ):

CIFAR-100 (20-task Class-IL): $86.98\%$ (vs. $85$– $86\%$ competing methods)
ImageNet-R: $74.53\%$ (vs. $69$– $70\%$ )
ImageNet-A: $53.32\%$ (vs. $49$– $50\%$ )
VTAB: $89.64\%$ (vs. $83$– $84\%$ )

Ablation findings:

Mixture design: Learned router is superior to uniform/random weighting, with a $5$– $10\%$ drop in $\mathcal{R}_N$ otherwise.
Expansion logic: Forcing per-task adapter addition doubles parameter cost and can decrease accuracy via router interference.
Thresholding: Stable accuracy in $z$ -score ranges $[1.0,2.0]$ with varying adapter count.
Adapter type: Both “AdaptFormer” (linear) and “Convpass” (convolutional) designs perform comparably.
Multi-layer vs single-layer adaptation: Expanding in last $3$ blocks yields a $2$– $3\%$ accuracy improvement.

7. Design Implications, Limitations, and Extensions

The CMoE-Adapter paradigm enables strict non-interference (zero forgetting) of learned tasks, robust adaptation to genuine shifts, and efficient parameter utilization. Its modular architecture with representation descriptors provides a principled expansion trigger, precluding unnecessary growth and computational bloat.

A plausible implication is that multi-level self-expansion (beyond the final block) systematically improves continual learning outcomes. The strong performance without replay suggests functional adapters plus OOD descriptors constitute a powerful mechanism for balancing stability and plasticity in frozen PTM frameworks.

Current limitations include the computational cost of autoencoder-based shift detection and the finite expansion window per layer/task. Extensions could explore alternative descriptor types, hybrid replay triggers, or more granular router architectures for fine-grained knowledge allocation.

In conclusion, the CMoE-Adapter instantiated in SEMA (Wang et al., 27 Mar 2024) demonstrates a parameter-efficient, modular mixture strategy for continual learning, achieving state-of-the-art results in strict no-rehearsal settings and establishing a template for future expansion- and reuse-based CL frameworks in deep neural architectures.

PDF Markdown Chat (Pro)

References (1)

Self-Expansion of Pre-trained Models with Mixture of Adapters for Continual Learning (2024)

Follow Topic

Get notified by email when new papers are published related to Continual Mixture of Experts Adapter (CMoE-Adapter).