Catastrophic forgetting is a major challenge in continual learning (CL), where neural networks struggle to learn new tasks sequentially without losing performance on previously learned ones. This is particularly problematic with large pre-trained models like Transformers, which are expensive to train from scratch and require significant data. Existing CL methods, such as replay-based (memory intensive), regularization-based (hard to tune), and parameter isolation (parameter growth), often face practical limitations regarding memory, computation, or performance scaling.
The paper validates the Adaptive Distillation of Adapters (ADA) approach, originally developed for text classification (Ermis et al., 2022 ), for continual image classification tasks using pre-trained Vision Transformers (ViT) (Dosovitskiy et al., 2020 ) and Data-Efficient Image Transformers (DeiT) (Touvron et al., 2020 ). The core idea of ADA is to enable continual learning by adding and managing a small, constant number of task-specific parameters (Adapters) while keeping the large pre-trained base model frozen. This avoids the need to store data from old tasks or retrain the entire model, addressing key practical constraints.
The ADA algorithm operates in two main steps for each new task :
- Train a new Adapter and Head: A new, small Adapter () and a task-specific classification head () are initialized and trained using the data for the current task . The parameters of the pre-trained base Vision Transformer () remain frozen. An Adapter is implemented as a 2-layer feed-forward network inserted into each transformer layer, adding a small number of parameters relative to the base model (e.g., ~1.8M parameters for a ViT-B adapter, compared to 86M for the base).
- Consolidate Adapters (if pool is full): ADA maintains a fixed-size pool of adapters. If this pool is not yet full (), the newly trained adapter is added to the pool. If the pool is full (), a selection process occurs:
- Select Adapter for Consolidation: Transferability estimation methods, specifically Log Expected Empirical Prediction (LEEP) (Ding et al., 2020 ) or TransRate (Wierda et al., 2021 ), are used to measure the similarity between the new task and the tasks represented by the adapters currently in the pool. The adapter in the pool corresponding to the task with the highest estimated transferability to is selected for consolidation.
- Distillation: The selected old adapter () and the newly trained adapter () are consolidated into a new adapter through a distillation process. This involves training using unlabeled distillation data () to minimize the difference between its outputs and the combined outputs (logits) of the frozen old and new models. This distillation step aims to transfer knowledge from both the old and new tasks into the consolidated adapter. The newly trained consolidated adapter replaces in the pool.
- Mapping: A mapping is maintained to link each learned task to the specific adapter in the pool and its corresponding head required for inference.
For inference on any task , the system uses the pre-trained frozen base model, the specific adapter assigned to task from the current pool, and the head trained for task .
The paper evaluates ADA against several baselines on CIFAR100 and MiniImageNet datasets in two sequential task scenarios: binary classification and multi-class classification (5 classes per task). Baselines include fine-tuning only the head (B1), fine-tuning the full model (B2), training separate adapters for each task (Adapters), combining adapters with AdapterFusion (Semenov et al., 2021 ), Experience Replay (ER), and Elastic Weight Consolidation (EWC). ADA is tested with a pool size , and also with for comparison.
Experimental results show that:
- Simple fine-tuning methods (B1, B2) suffer significantly from catastrophic forgetting.
- Adapters and AdapterFusion effectively prevent forgetting but require storing a number of adapters proportional to the number of tasks, leading to significant memory growth.
- EWC struggles to maintain performance as tasks accumulate, particularly in the multi-class setting.
- ADA with performs similarly to ER, indicating that basic distillation alone or small replay memory has limited effectiveness over many tasks.
- ADA-LEEP and ADA-TransRate with achieve average accuracy comparable to the memory-intensive Adapters and AdapterFusion baselines, especially for binary classification tasks. While their performance slightly declines for multi-class tasks with many tasks, the number of parameters remains constant.
- Crucially, ADA keeps the total number of model parameters constant (or nearly constant, ignoring the small task heads) after the pool reaches its maximum size . This provides significant memory efficiency compared to methods that add parameters per task.
Method | Trainable Parameters per Task (after K tasks) | Inference Parameters (after N tasks) | Total Parameters (after N tasks) |
---|---|---|---|
Fine-tuning (B1, B2), EWC | 0 (B1), 86M (B2, EWC) | 86M | 86M |
Adapters, AdapterFusion | 1.8M | 86M + F * 1.8M (F is fused adapters) | 86M + N * 1.8M (Adapters N=F=1) |
ADA (K adapters pool) | 1.8M (new) + 1.8M (consolidated) | 86M + 1.8M | 86M + (K+1) * 1.8M |
Note: Table values are approximate for ViT-B, excluding small head parameters.
ADA demonstrates that using adapters with pre-trained vision transformers for continual learning is viable. By strategically consolidating adapters based on transferability, ADA balances predictive performance with parameter efficiency, offering a practical solution for scenarios where data arrives sequentially, and memory/computation resources are constrained. The approach is shown to work with different ViT architectures (ViT and DeiT). Future work could explore different adapter architectures tailored for vision and dynamic control of the adapter pool size.