Continual Learning with Transformers for Image Classification
(2206.14085v1)
Published 28 Jun 2022 in cs.LG and cs.CV
Abstract: In many real-world scenarios, data to train machine learning models become available over time. However, neural network models struggle to continually learn new concepts without forgetting what has been learnt in the past. This phenomenon is known as catastrophic forgetting and it is often difficult to prevent due to practical constraints, such as the amount of data that can be stored or the limited computation sources that can be used. Moreover, training large neural networks, such as Transformers, from scratch is very costly and requires a vast amount of training data, which might not be available in the application domain of interest. A recent trend indicates that dynamic architectures based on an expansion of the parameters can reduce catastrophic forgetting efficiently in continual learning, but this needs complex tuning to balance the growing number of parameters and barely share any information across tasks. As a result, they struggle to scale to a large number of tasks without significant overhead. In this paper, we validate in the computer vision domain a recent solution called Adaptive Distillation of Adapters (ADA), which is developed to perform continual learning using pre-trained Transformers and Adapters on text classification tasks. We empirically demonstrate on different classification tasks that this method maintains a good predictive performance without retraining the model or increasing the number of model parameters over the time. Besides it is significantly faster at inference time compared to the state-of-the-art methods.
Catastrophic forgetting is a major challenge in continual learning (CL), where neural networks struggle to learn new tasks sequentially without losing performance on previously learned ones. This is particularly problematic with large pre-trained models like Transformers, which are expensive to train from scratch and require significant data. Existing CL methods, such as replay-based (memory intensive), regularization-based (hard to tune), and parameter isolation (parameter growth), often face practical limitations regarding memory, computation, or performance scaling.
The paper validates the Adaptive Distillation of Adapters (ADA) approach, originally developed for text classification (Ermis et al., 2022), for continual image classification tasks using pre-trained Vision Transformers (ViT) (Dosovitskiy et al., 2020) and Data-Efficient Image Transformers (DeiT) (Touvron et al., 2020). The core idea of ADA is to enable continual learning by adding and managing a small, constant number of task-specific parameters (Adapters) while keeping the large pre-trained base model frozen. This avoids the need to store data from old tasks or retrain the entire model, addressing key practical constraints.
The ADA algorithm operates in two main steps for each new task Tn:
Train a new Adapter and Head: A new, small Adapter (Φn) and a task-specific classification head (hn) are initialized and trained using the data for the current task Tn. The parameters of the pre-trained base Vision Transformer (Θ) remain frozen. An Adapter is implemented as a 2-layer feed-forward network inserted into each transformer layer, adding a small number of parameters relative to the base model (e.g., ~1.8M parameters for a ViT-B adapter, compared to 86M for the base).
Consolidate Adapters (if pool is full): ADA maintains a fixed-size pool of K adapters. If this pool is not yet full (n≤K), the newly trained adapter Φn is added to the pool. If the pool is full (n>K), a selection process occurs:
Select Adapter for Consolidation: Transferability estimation methods, specifically Log Expected Empirical Prediction (LEEP) (Ding et al., 2020) or TransRate (Wierda et al., 2021), are used to measure the similarity between the new task Tn and the tasks represented by the adapters currently in the pool. The adapter foldj∗ in the pool corresponding to the task with the highest estimated transferability to Tn is selected for consolidation.
Distillation: The selected old adapter (foldj∗) and the newly trained adapter (fn) are consolidated into a new adapter Φc through a distillation process. This involves training Φc using unlabeled distillation data (Ddistill) to minimize the L2 difference between its outputs and the combined outputs (logits) of the frozen old and new models. This distillation step aims to transfer knowledge from both the old and new tasks into the consolidated adapter. The newly trained consolidated adapter replaces foldj∗ in the pool.
Mapping: A mapping is maintained to link each learned task to the specific adapter in the pool and its corresponding head required for inference.
For inference on any task i, the system uses the pre-trained frozen base model, the specific adapter assigned to task i from the current pool, and the head hi trained for task i.
The paper evaluates ADA against several baselines on CIFAR100 and MiniImageNet datasets in two sequential task scenarios: binary classification and multi-class classification (5 classes per task). Baselines include fine-tuning only the head (B1), fine-tuning the full model (B2), training separate adapters for each task (Adapters), combining adapters with AdapterFusion (Semenov et al., 2021), Experience Replay (ER), and Elastic Weight Consolidation (EWC). ADA is tested with a pool size K=4, and also with K=1 for comparison.
Experimental results show that:
Simple fine-tuning methods (B1, B2) suffer significantly from catastrophic forgetting.
Adapters and AdapterFusion effectively prevent forgetting but require storing a number of adapters proportional to the number of tasks, leading to significant memory growth.
EWC struggles to maintain performance as tasks accumulate, particularly in the multi-class setting.
ADA with K=1 performs similarly to ER, indicating that basic distillation alone or small replay memory has limited effectiveness over many tasks.
ADA-LEEP and ADA-TransRate with K=4 achieve average accuracy comparable to the memory-intensive Adapters and AdapterFusion baselines, especially for binary classification tasks. While their performance slightly declines for multi-class tasks with many tasks, the number of parameters remains constant.
Crucially, ADA keeps the total number of model parameters constant (or nearly constant, ignoring the small task heads) after the pool reaches its maximum size K. This provides significant memory efficiency compared to methods that add parameters per task.
Method
Trainable Parameters per Task (after K tasks)
Inference Parameters (after N tasks)
Total Parameters (after N tasks)
Fine-tuning (B1, B2), EWC
0 (B1), 86M (B2, EWC)
86M
86M
Adapters, AdapterFusion
1.8M
86M + F * 1.8M (F is fused adapters)
86M + N * 1.8M (Adapters N=F=1)
ADA (K adapters pool)
1.8M (new) + 1.8M (consolidated)
86M + 1.8M
86M + (K+1) * 1.8M
Note: Table values are approximate for ViT-B, excluding small head parameters.
ADA demonstrates that using adapters with pre-trained vision transformers for continual learning is viable. By strategically consolidating adapters based on transferability, ADA balances predictive performance with parameter efficiency, offering a practical solution for scenarios where data arrives sequentially, and memory/computation resources are constrained. The approach is shown to work with different ViT architectures (ViT and DeiT). Future work could explore different adapter architectures tailored for vision and dynamic control of the adapter pool size.