Papers
Topics
Authors
Recent
Search
2000 character limit reached

MoE-Adapter4CL: Adapter-Based Continual Learning

Updated 29 December 2025
  • MoE-Adapter4CL is an adapter-based framework that integrates task-specific and shared LoRA adapters with a gating network for continual and multi-task learning.
  • It employs adversarial regularization to ensure task-agnostic shared pathways, effectively mitigating catastrophic forgetting while promoting knowledge transfer.
  • Empirical results highlight improved accuracy and reduced operational costs, validating its scalability and efficiency in diverse industrial applications.

Mixture-of-Experts Adapters for Continual Learning (MoE-Adapter4CL) refers to the integration of Mixture-of-Experts (MoE) architectures—particularly those built from adapter modules (notably LoRA adapters)—within the context of continual learning (CL) and multi-task adaptation in LLMs. Although the precise term "MoE-Adapter4CL" does not appear as a canonical system in the current literature, this designation accurately summarizes a family of recent approaches that meld LoRA-style lightweight adaptation, modular expert routing, and continual learning via dynamic expert composition and gating. The primary objectives are mitigating catastrophic forgetting, enabling efficient knowledge transfer across tasks, and scaling to diverse application domains with minimized memory and compute overhead.

1. Architectural Principle: Adapter-Based MoE for Continual and Multi-Task Learning

Adapter-based MoE architectures reconfigure each transformer layer's feed-forward network (FFN) or select sub-components, replacing or augmenting them with a set of expert modules—each an adapter trained for a specific domain, task, or data regime. These experts are typically LoRA adapters: low-rank parameter-efficient modules that alter the frozen base model's subspace via incremental, trainable matrices. In MoE-CL designs, two forms of adapters are instantiated:

  • Task-specific LoRA experts (θt\theta_t): One instantiated per task or domain, trained and then frozen to preserve task-unique information and guarantee zero-forgetting on previously seen tasks.
  • Shared LoRA expert (θs\theta_s): Trained across all tasks, intended to facilitate generic representation learning and forward transfer.

A lightweight gating network at each adapter composition point dynamically computes mixture coefficients βs,βt\beta_s, \beta_t (with βs+βt=1\beta_s + \beta_t = 1) for combining the hidden-state updates from the shared and task-specific experts. The resulting output at each block is zi+1=βszs+βtztz_{i+1} = \beta_s z_s + \beta_t z_t, where zsz_s and ztz_t are the outputs of the shared and current task-specific adapters, respectively (Kang et al., 14 Sep 2025).

2. Mathematical Formalism and Training Objective

For an input hidden state ziz_i, the adapterized update is defined by: zs=LoRA(zi;θs),zt=LoRA(zi;θt),(βs,βt)=G(zi),zi+1=βszs+βtztz_s = \text{LoRA}(z_i ; \theta_s), \quad z_t = \text{LoRA}(z_i ; \theta_t), \quad (\beta_s, \beta_t) = G(z_i), \quad z_{i+1} = \beta_s z_s + \beta_t z_t where GG is the gating network; LoRA applies the parameter-efficient low-rank modification using AA, BB matrices per adapter: ΔW=αrBA,Wadapted=W+ΔW\Delta W = \frac{\alpha}{r} B A, \quad W_{\text{adapted}} = W + \Delta W with only AA and BB trainable, WW frozen.

A critical innovation for continual learning is adversarial regularization. The output of the shared expert zsz_s is input to a task discriminator DϕD_\phi, trained to predict the current task label: l^t=softmax(Dϕ(zs)),Ldisc=k1{k=t}logl^k\hat{l}_t = \text{softmax}(D_\phi(z_s)), \quad L_{\text{disc}} = -\sum_k 1\{k=t\} \log \hat{l}_k The shared expert is adversarially trained to minimize the ability of DD to recover the task signal from zsz_s, enforcing that zsz_s remains largely task-agnostic. The overall training loss is: L=LSFTαLdiscL = L_{\text{SFT}} - \alpha \cdot L_{\text{disc}} where LSFTL_{\text{SFT}} is the standard instruction tuning loss, and α\alpha balances adversarial strength (Kang et al., 14 Sep 2025).

3. Routing Mechanisms and Adapter Mixture Variants

Mixture composition can be controlled in several ways:

  • Fixed ("gate-free") routing: Equal weighting of all experts; practical when using very few, large adapters.
  • Noisy or random top-KK routing: Linear gating scores with additive white Gaussian noise followed by top-KK expert selection and uniform weighting among selected; avoids costly routing models.
  • Learned ('router-trained') routing: A small learnable MLP or linear layer generates gating logits, mixed via softmax or top-KK softmax with temperature scaling; enables data-driven expert selection but increases tuning complexity.

MoE-CL specifically uses a two-expert gating mechanism as outlined above, with a trainable, shallow gating network and adversarial regularization on the shared pathway. Dual-adapter per-layer architecture does not preclude extension to multi-way mixtures, but practical empirical protocols restrict E=2E=2 for memory/compute reasons (Lee et al., 2024, Kang et al., 14 Sep 2025).

4. Algorithmic Protocol for Continual Adaptation

At every new task tt, the protocol is:

  • Instantiate a new task-specific LoRA adapter θt\theta_t; train only on task data, freeze upon task completion.
  • Continue to fine-tune the shared expert θs\theta_s using both LSFTL_{\text{SFT}} and adversarial LdiscL_{\text{disc}}, promoting cross-task knowledge and robustness.
  • All prior adapters remain dormant but accessible, and only the shared plus current adapters are active at any point, keeping inference/memory overhead minimal.
  • The discriminator is reset per task and only trained on the current task's data.

No replay buffer or data augmentation is required; catastrophic forgetting is controlled by parameter isolation via the frozen adapters, while transfer and continual generalization stem from the shared expert and adversarial filtering (Kang et al., 14 Sep 2025).

5. Empirical Evaluation in Industrial and Research Contexts

MoE-Adapter4CL frameworks (as instantiated in MoE-CL) yield state-of-the-art performance on both public and industrial continual instruction-tuning benchmarks:

  • On MTL5 (four classification tasks), MoE-CL achieves average accuracy 80.5%±1.580.5\% \pm 1.5 (Llama-2 base), outperforming MoCL, O-LoRA, and per-task FT baselines. Backward transfer (BwT) and forward transfer (FwT) metrics demonstrate improved retention and future transfer (Kang et al., 14 Sep 2025).
  • On Tencent3 (three binary content-compliance tasks), MoE-CL achieves 63.42%±0.7463.42\% \pm 0.74, surpassing all tested alternatives, with lowest forgetting and highest positive transfer.
  • Ablation experiments show that omitting the GAN adversarial component reduces final accuracy by $1$–$2$ points and weakens forgetting/transference metrics.

Industrial deployment on Tencent Video yielded a 15.3%15.3\% reduction in manual review costs and a $15.3$ percentage point increase in content-stripping rate, confirming large-scale practicality (Kang et al., 14 Sep 2025).

6. Implementation Properties, Overhead, and Scalability

Each new task adds $2$ LoRA adapters per transformer layer: one task-specific (frozen post-training), one shared (continually updated). For model dimension dmodeld_{\text{model}} and LoRA rank rr (e.g., r=8r=8, dmodel=4096d_{\text{model}}=4096), this results in:

  • 131\sim131k parameters/layer (4.2\sim4.2M for 32 layers) per task for storage.
  • Inference latency ($6.3$ms/sample for MoE-CL) is modestly increased (by  40~4050%50\%) due to double adapter evaluation and gating.
  • GPU memory remains constant during inference since only the shared and current task adapter are loaded; prior adapters are dormant (Kang et al., 14 Sep 2025).

Alternatives include:

  • Toolkit-based MoE with adapters or FFN experts: Multiple experts with various router choices; generalizes to domain mixtures, but lacks explicit continual learning protocols or adversarial routing (Lee et al., 2024).
  • SVD-based orthogonal MoE adapters (MoORE): Mixture of orthogonal rank-one experts per SVD direction, with sample- and task-conditioned router, focused on conflict- and oblivion-resistance by orthogonality and per-sample gating (Yuan et al., 17 Jun 2025).
  • Non-adversarial parameter isolation (e.g., O-LoRA, MoCL): Only task-specific adapters, no transfer via shared mechanism, higher risk of knowledge ossification.

A plausible implication is that MoE-Adapter4CL, as formalized in MoE-CL, synthesizes benefits of modular expert composition (for robustness and flexibility), parameter isolation (to mitigate forgetting), and adversarially regulated knowledge transfer (to optimize generalization and continual adaptation), distinguishing itself among continual learning frameworks for LLMs.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MoE-Adapter4CL.