Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning (2308.09544v3)

Published 18 Aug 2023 in cs.LG, cs.AI, and cs.CV

Abstract: In this work, we investigate exemplar-free class incremental learning (CIL) with knowledge distillation (KD) as a regularization strategy, aiming to prevent forgetting. KD-based methods are successfully used in CIL, but they often struggle to regularize the model without access to exemplars of the training data from previous tasks. Our analysis reveals that this issue originates from substantial representation shifts in the teacher network when dealing with out-of-distribution data. This causes large errors in the KD loss component, leading to performance degradation in CIL models. Inspired by recent test-time adaptation methods, we introduce Teacher Adaptation (TA), a method that concurrently updates the teacher and the main models during incremental training. Our method seamlessly integrates with KD-based CIL approaches and allows for consistent enhancement of their performance across multiple exemplar-free CIL benchmarks. The source code for our method is available at https://github.com/fszatkowski/cl-teacher-adaptation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Filip Szatkowski (9 papers)
  2. Mateusz Pyla (3 papers)
  3. Marcin Przewięźlikowski (10 papers)
  4. Sebastian Cygert (18 papers)
  5. Bartłomiej Twardowski (37 papers)
  6. Tomasz Trzciński (116 papers)
Citations (7)

Summary

  • The paper introduces a method that adapts the teacher’s Batch Normalization statistics to the current task, enhancing knowledge transfer in exemplar-free settings.
  • It demonstrates that updating BN statistics mitigates catastrophic forgetting, ensuring better retention of previously learned tasks.
  • It underscores the approach's dependency on BN-enabled architectures, validated through experiments on datasets like CIFAR100 and DomainNet.

This paper, "Adapt Your Teacher: Improving Knowledge Distillation for Exemplar-free Continual Learning" (Szatkowski et al., 2023 ), addresses the challenging problem of catastrophic forgetting in exemplar-free continual learning (CIL). Exemplar-free CIL requires a model to learn a sequence of tasks without storing or revisiting data from previous tasks, making it particularly susceptible to losing performance on previously learned tasks.

The core idea proposed is to improve the effectiveness of knowledge distillation (KD) in this setting by "adapting the teacher" model. In a typical KD setup for CIL, the teacher is the model trained on previous tasks, and the student is the model being trained on the current task. The student is encouraged to mimic the output or representations of the teacher to retain knowledge of past tasks. The challenge is that the teacher model's internal statistics (like batch normalization statistics) and outputs are tuned to the data distribution of previous tasks, which may differ significantly from the current task's data distribution. This mismatch can make the teacher less effective at guiding the student on the new data.

The paper proposes a specific method for teacher adaptation: updating the running statistics of the Batch Normalization (BN) layers in the teacher model using data from the current task. When the current task's data is passed through the teacher model, the BN layers' running mean and variance are updated with the statistics of the current batch. This process helps align the teacher's internal representations and outputs more closely with the distribution of the data being currently processed, making the knowledge distilled from the teacher more relevant and effective for the student.

Key aspects and practical implications highlighted:

  1. Improved Knowledge Transfer: By adapting the teacher's BN statistics, the teacher's output becomes better aligned with the current task's data distribution. This facilitates more effective knowledge distillation, helping the student model learn the new task while retaining more information about previous ones, thereby mitigating catastrophic forgetting.
  2. Exemplar-Free Focus: The method is specifically designed for and evaluated in the exemplar-free setting, which is more memory-constrained and difficult than settings allowing exemplars. This makes the technique particularly valuable for resource-limited environments or privacy-sensitive applications where storing old data is not feasible.
  3. Dependence on Batch Normalization: The proposed teacher adaptation method explicitly relies on models that use Batch Normalization layers, as it leverages the running statistics maintained by these layers. It is noted that the method is not applicable to architectures without BN, such as those using Group Normalization which do not maintain running statistics.
  4. Implementation: The core implementation likely involves configuring the teacher model's BN layers to be in "training" or a special adaptation mode during the training phase for a new task. In this mode, while the teacher's weights remain frozen, its BN running mean and variance are updated by feeding current task data batches through it. This adaptation could potentially happen during a specific warmup phase or interspersed throughout the student training on the current task.
  5. Experimental Validation: The method was evaluated on various datasets including CIFAR100, TinyImageNet200, DomainNet, fine-grained datasets, and corrupted CIFAR100, demonstrating its effectiveness under different data shifts and complexities. Experiments included ablation studies validating the choice of adapting normalization statistics and confirming performance gains across different architectures (likely including ResNets due to BN reliance) and batch sizes.

In summary, the paper introduces a practical technique to enhance knowledge distillation for exemplar-free continual learning by adapting the teacher model's batch normalization statistics to the distribution of the current task data. This adaptation helps the teacher provide more relevant guidance to the student, leading to improved performance in mitigating catastrophic forgetting, particularly for models employing Batch Normalization.

Github Logo Streamline Icon: https://streamlinehq.com