Self-Expanding Adapters
- Self-Expanding Adapters (SEMA) are a dynamic mechanism that augment continual learning in Transformers by selectively expanding capacity with modular adapters and frozen descriptors.
- They integrate lightweight functional adapters with autoencoder-based descriptors that detect data distribution shifts, triggering minimal and efficient expansion.
- A learned routing network composes outputs from existing and new adapters, balancing stability and plasticity while achieving sub-linear parameter growth and mitigating forgetting.
Self-Expanding Adapters (SEMA) constitute a mechanism for dynamically augmenting the adaptation capacity of pre-trained Transformer models (“PTMs”) in continual learning (CL) scenarios. SEMA enables selective, data-driven growth by equipping each Transformer layer with modular adapters and descriptors, coupled with a routing infrastructure to compose their outputs. This strategy addresses the stability–plasticity dilemma endemic to continual learning by freezing prior modules and instantiating new ones only when significant shifts in data distribution are detected, leading to sub-linear parameter growth and improved knowledge reuse across tasks (Wang et al., 2024).
1. Motivation and Conceptual Foundations
Continual learning with PTMs targets accumulation of knowledge from a non-stationary task sequence while obviating catastrophic forgetting. Pre-trained Transformers, when naively fine-tuned per task, risk parameter interference and eventual capacity exhaustion. Existing adapter or prompt-based PTM approaches typically fix a priori either a budget or a schedule for parameter expansion, which leads to a trade-off: constrained capacity impairs adaptation, whereas routine addition of task-specific modules incurs linear growth and poor knowledge reuse.
SEMA introduces a self-expansion paradigm, in which the architecture detects, per layer, whether its existing representations suffice for the incoming data. New adapters are introduced only if the current modules' descriptors signal that the batch is “out-of-distribution.” This enables a principled reconciliation of the stability–plasticity trade-off, since older modules are retained (guaranteeing retention of previous knowledge), and only the minimum sufficient capacity is added.
2. Modular Adapter and Descriptor Architecture
At each layer , adapters are organized as pairs :
- Functional Adapter (): Implements a lightweight adaptation via a bottleneck design inspired by AdaptFormer. Given ,
with , , . Only these matrices are trained per new adapter.
- Representation Descriptor (): Each adapter is equipped with an autoencoder acting as a distribution shift detector. The autoencoder is trained by minimizing
After training, the module is frozen; batch-wise mean and standard deviation of reconstruction error are maintained to calculate standardized scores:
Expansion at layer is triggered if all descriptors report , where is a preset threshold.
3. Expandable Router Mechanism
Multiple adapters at a given layer are composed via a learned, expandable router:
- Routing Network: Computes
and the outputs are normalized via softmax, producing mixture weights for each of the adapters:
Model output for the next layer:
- Expansion Protocol: When a new adapter is added, gains a new trainable column; prior columns are frozen, ensuring stable routing for past tasks.
4. Self-Expansion Algorithm
The algorithm proceeds as follows:
- Initialization (): One adapter–descriptor pair per layer is jointly trained for the first task.
- Subsequent Tasks ():
- Freeze all current adapters, descriptors, and routing parameters.
- For selected layers and mini-batches, compute descriptor -scores.
- If expansion is triggered at a layer, instantiate a new adapter–descriptor pair and enrich the router.
- Train only the new adapter, new router column, and new descriptor for epochs.
- Newly added modules are then frozen; process recurses with deeper layers.
- At task end, (re)train classification head.
Pseudocode and implementation details are available as Algorithm 1 in the referenced work.
5. Stability–Plasticity Balance and Growth Properties
SEMA’s module freezing enforces strict stability—parameters encoding prior knowledge are unmodified, preventing forgetting. Plasticity is delivered by insertion of new, compact adapters only at statistically validated outlier points, as detected by the frozen descriptors. This results in sub-linear parameter growth, since many tasks reuse adapters if their distributions do not cross expansion thresholds. Empirical results show on ImageNet-A and VTAB, fewer than one adapter is added per task on average, contrasting with linear-growth expansion-by-task baselines (Wang et al., 2024).
Ablation studies indicate that disabling expansion (“No Expansion”) significantly degrades performance, and naive per-task expansion leads to parameter inefficiency with only marginal gains in accuracy.
6. Experimental Performance Across Benchmarks
SEMA’s effectiveness has been validated on multiple datasets such as CIFAR-100, ImageNet-R, ImageNet-A, and VTAB, following a class-incremental protocol (10 classes per task, 20 tasks). The backbone is ViT-B/16 pre-trained on IN1K, and no rehearsal buffer is used.
Notable results:
- Average accuracy : SEMA outperforms competitive baselines. Example: 53.32% (SEMA) vs. 49.24% (best baseline) on ImageNet-A; 89.64% vs. 83.61% on VTAB.
- Forgetting: Incremental accuracy decay curves are gentler, indicating reduced forgetting.
- Ablations:
- “Soft” router mixing outperforms average, random, or top-1 strategies.
- Choice between AdaptFormer and ConvPass adapters has minimal effect under SEMA.
- The expansion threshold affords stable accuracy with a smooth parameter trade-off.
- More extensive multi-layer expansion (last 4 blocks) yields only modest additional gains.
7. Complexity Analysis and Resource Efficiency
SEMA exhibits markedly lower parameter overhead compared to expansion-by-task or prompt-based rehearsal-free methods. For instance:
- ImageNet-A (20 tasks, cumulative): SEMA adds ≈0.56 M parameters, compared to ≈1.90 M for expansion-by-task and 3.9 M for DualPrompt/CODA-P.
- Runtime and Memory: During inference, descriptors are discarded; only adapters and routers are active. SEMA features per-image latency of 4.48 ms (CIFAR-100) to 9.01 ms (ImageNet-A), surpassing prompting methods (2–3× faster) since multiple full PTM passes are unnecessary. Only compact matrices and router columns accumulate in memory.
These properties position SEMA as an efficient, architecture-conscious approach for PTM-based continual learning, achieving state-of-the-art performance without rehearsal and with minimal expansion (Wang et al., 2024).