MoE-Adapter4CL: Adapter-Based Continual Learning

Updated 29 December 2025

MoE-Adapter4CL is an adapter-based framework that integrates task-specific and shared LoRA adapters with a gating network for continual and multi-task learning.
It employs adversarial regularization to ensure task-agnostic shared pathways, effectively mitigating catastrophic forgetting while promoting knowledge transfer.
Empirical results highlight improved accuracy and reduced operational costs, validating its scalability and efficiency in diverse industrial applications.

Mixture-of-Experts Adapters for Continual Learning (MoE-Adapter4CL) refers to the integration of Mixture-of-Experts (MoE) architectures—particularly those built from adapter modules (notably LoRA adapters)—within the context of continual learning (CL) and multi-task adaptation in LLMs. Although the precise term "MoE-Adapter4CL" does not appear as a canonical system in the current literature, this designation accurately summarizes a family of recent approaches that meld LoRA-style lightweight adaptation, modular expert routing, and continual learning via dynamic expert composition and gating. The primary objectives are mitigating catastrophic forgetting, enabling efficient knowledge transfer across tasks, and scaling to diverse application domains with minimized memory and compute overhead.

1. Architectural Principle: Adapter-Based MoE for Continual and Multi-Task Learning

Adapter-based MoE architectures reconfigure each transformer layer's feed-forward network (FFN) or select sub-components, replacing or augmenting them with a set of expert modules—each an adapter trained for a specific domain, task, or data regime. These experts are typically LoRA adapters: low-rank parameter-efficient modules that alter the frozen base model's subspace via incremental, trainable matrices. In MoE-CL designs, two forms of adapters are instantiated:

Task-specific LoRA experts ( $\theta_t$ ): One instantiated per task or domain, trained and then frozen to preserve task-unique information and guarantee zero-forgetting on previously seen tasks.
Shared LoRA expert ( $\theta_s$ ): Trained across all tasks, intended to facilitate generic representation learning and forward transfer.

A lightweight gating network at each adapter composition point dynamically computes mixture coefficients $\beta_s, \beta_t$ (with $\beta_s + \beta_t = 1$ ) for combining the hidden-state updates from the shared and task-specific experts. The resulting output at each block is $z_{i+1} = \beta_s z_s + \beta_t z_t$ , where $z_s$ and $z_t$ are the outputs of the shared and current task-specific adapters, respectively (Kang et al., 14 Sep 2025).

2. Mathematical Formalism and Training Objective

For an input hidden state $z_i$ , the adapterized update is defined by: $z_s = \text{LoRA}(z_i ; \theta_s), \quad z_t = \text{LoRA}(z_i ; \theta_t), \quad (\beta_s, \beta_t) = G(z_i), \quad z_{i+1} = \beta_s z_s + \beta_t z_t$ where $G$ is the gating network; LoRA applies the parameter-efficient low-rank modification using $A$ , $B$ matrices per adapter: $\Delta W = \frac{\alpha}{r} B A, \quad W_{\text{adapted}} = W + \Delta W$ with only $A$ and $B$ trainable, $W$ frozen.

A critical innovation for continual learning is adversarial regularization. The output of the shared expert $z_s$ is input to a task discriminator $D_\phi$ , trained to predict the current task label: $\hat{l}_t = \text{softmax}(D_\phi(z_s)), \quad L_{\text{disc}} = -\sum_k 1\{k=t\} \log \hat{l}_k$ The shared expert is adversarially trained to minimize the ability of $D$ to recover the task signal from $z_s$ , enforcing that $z_s$ remains largely task-agnostic. The overall training loss is: $L = L_{\text{SFT}} - \alpha \cdot L_{\text{disc}}$ where $L_{\text{SFT}}$ is the standard instruction tuning loss, and $\alpha$ balances adversarial strength (Kang et al., 14 Sep 2025).

3. Routing Mechanisms and Adapter Mixture Variants

Mixture composition can be controlled in several ways:

Fixed ("gate-free") routing: Equal weighting of all experts; practical when using very few, large adapters.
Noisy or random top- $K$ routing: Linear gating scores with additive white Gaussian noise followed by top- $K$ expert selection and uniform weighting among selected; avoids costly routing models.
Learned ('router-trained') routing: A small learnable MLP or linear layer generates gating logits, mixed via softmax or top- $K$ softmax with temperature scaling; enables data-driven expert selection but increases tuning complexity.

MoE-CL specifically uses a two-expert gating mechanism as outlined above, with a trainable, shallow gating network and adversarial regularization on the shared pathway. Dual-adapter per-layer architecture does not preclude extension to multi-way mixtures, but practical empirical protocols restrict $E=2$ for memory/compute reasons (Lee et al., 2024, Kang et al., 14 Sep 2025).

4. Algorithmic Protocol for Continual Adaptation

At every new task $t$ , the protocol is:

Instantiate a new task-specific LoRA adapter $\theta_t$ ; train only on task data, freeze upon task completion.
Continue to fine-tune the shared expert $\theta_s$ using both $L_{\text{SFT}}$ and adversarial $L_{\text{disc}}$ , promoting cross-task knowledge and robustness.
All prior adapters remain dormant but accessible, and only the shared plus current adapters are active at any point, keeping inference/memory overhead minimal.
The discriminator is reset per task and only trained on the current task's data.

No replay buffer or data augmentation is required; catastrophic forgetting is controlled by parameter isolation via the frozen adapters, while transfer and continual generalization stem from the shared expert and adversarial filtering (Kang et al., 14 Sep 2025).

5. Empirical Evaluation in Industrial and Research Contexts

MoE-Adapter4CL frameworks (as instantiated in MoE-CL) yield state-of-the-art performance on both public and industrial continual instruction-tuning benchmarks:

On MTL5 (four classification tasks), MoE-CL achieves average accuracy $80.5\% \pm 1.5$ (Llama-2 base), outperforming MoCL, O-LoRA, and per-task FT baselines. Backward transfer (BwT) and forward transfer (FwT) metrics demonstrate improved retention and future transfer (Kang et al., 14 Sep 2025).
On Tencent3 (three binary content-compliance tasks), MoE-CL achieves $63.42\% \pm 0.74$ , surpassing all tested alternatives, with lowest forgetting and highest positive transfer.
Ablation experiments show that omitting the GAN adversarial component reduces final accuracy by $1$–$2$ points and weakens forgetting/transference metrics.

Industrial deployment on Tencent Video yielded a $15.3\%$ reduction in manual review costs and a $15.3$ percentage point increase in content-stripping rate, confirming large-scale practicality (Kang et al., 14 Sep 2025).

6. Implementation Properties, Overhead, and Scalability

Each new task adds $2$ LoRA adapters per transformer layer: one task-specific (frozen post-training), one shared (continually updated). For model dimension $d_{\text{model}}$ and LoRA rank $r$ (e.g., $r=8$ , $d_{\text{model}}=4096$ ), this results in:

$\sim131$ k parameters/layer ( $\sim4.2$ M for 32 layers) per task for storage.
Inference latency ($6.3$ms/sample for MoE-CL) is modestly increased (by $~40$ – $50\%$ ) due to double adapter evaluation and gating.
GPU memory remains constant during inference since only the shared and current task adapter are loaded; prior adapters are dormant (Kang et al., 14 Sep 2025).

Alternatives include:

Toolkit-based MoE with adapters or FFN experts: Multiple experts with various router choices; generalizes to domain mixtures, but lacks explicit continual learning protocols or adversarial routing (Lee et al., 2024).
SVD-based orthogonal MoE adapters (MoORE): Mixture of orthogonal rank-one experts per SVD direction, with sample- and task-conditioned router, focused on conflict- and oblivion-resistance by orthogonality and per-sample gating (Yuan et al., 17 Jun 2025).
Non-adversarial parameter isolation (e.g., O-LoRA, MoCL): Only task-specific adapters, no transfer via shared mechanism, higher risk of knowledge ossification.

A plausible implication is that MoE-Adapter4CL, as formalized in MoE-CL, synthesizes benefits of modular expert composition (for robustness and flexibility), parameter isolation (to mitigate forgetting), and adversarially regulated knowledge transfer (to optimize generalization and continual adaptation), distinguishing itself among continual learning frameworks for LLMs.