Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model (2303.05118v4)

Published 9 Mar 2023 in cs.CV, cs.AI, and cs.LG

Abstract: The goal of continual learning is to improve the performance of recognition models in learning sequentially arrived data. Although most existing works are established on the premise of learning from scratch, growing efforts have been devoted to incorporating the benefits of pre-training. However, how to adaptively exploit the pre-trained knowledge for each incremental task while maintaining its generalizability remains an open question. In this work, we present an extensive analysis for continual learning on a pre-trained model (CLPM), and attribute the key challenge to a progressive overfitting problem. Observing that selectively reducing the learning rate can almost resolve this issue in the representation layer, we propose a simple but extremely effective approach named Slow Learner with Classifier Alignment (SLCA), which further improves the classification layer by modeling the class-wise distributions and aligning the classification layers in a post-hoc fashion. Across a variety of scenarios, our proposal provides substantial improvements for CLPM (e.g., up to 49.76%, 50.05%, 44.69% and 40.16% on Split CIFAR-100, Split ImageNet-R, Split CUB-200 and Split Cars-196, respectively), and thus outperforms state-of-the-art approaches by a large margin. Based on such a strong baseline, critical factors and promising directions are analyzed in-depth to facilitate subsequent research. Code has been made available at: https://github.com/GengDavid/SLCA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Memory aware synapses: Learning what (not) to forget. In Proceedings of the European Conference on Computer Vision, pages 139–154, 2018.
  2. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254.
  3. Dark experience for general continual learning: a strong, simple baseline. In Advances in Neural Information Processing Systems, volume 33, pages 15920–15930, 2020.
  4. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021.
  5. Continual pre-training mitigates forgetting in language and vision. arXiv preprint arXiv:2205.09357, 2022.
  6. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  7. Learning without memorizing. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5138–5146, 2019.
  8. The turking test: Can language models understand instructions? arXiv preprint arXiv:2010.11982, 2020.
  9. Self-supervised models are continual learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9621–9630, 2022.
  10. Spottune: transfer learning through adaptive fine-tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4805–4814, 2019.
  11. Remind your neural network to prevent catastrophic forgetting. In Proceedings of the European Conference on Computer Vision, pages 466–483, 2020.
  12. Lifelong machine learning with deep streaming linear discriminant analysis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 220–221, 2020.
  13. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  14. Rethinking imagenet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4918–4927, 2019.
  15. The many faces of robustness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8349, 2021.
  16. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  17. How well self-supervised pre-training performs with streaming data? arXiv preprint arXiv:2104.12081, 2021.
  18. Continual learning of natural language processing tasks: A survey. arXiv preprint arXiv:2211.12701, 2022.
  19. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521–3526, 2017.
  20. Similarity of neural network representations revisited. In Proceedings of International Conference on Machine Learning, pages 3519–3529. PMLR, 2019.
  21. 3d object representations for fine-grained categorization. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pages 554–561, 2013.
  22. Learning multiple layers of features from tiny images. Technical report, 2009.
  23. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, volume 25, pages 1097–1105, 2012.
  24. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
  25. Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2935–2947, 2017.
  26. Rethinking the representational continuity: Towards unsupervised continual learning. arXiv preprint arXiv:2110.06976, 2021.
  27. Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory. Psychological Review, 102(3):419, 1995.
  28. An empirical investigation of the role of pre-training in lifelong learning. arXiv preprint arXiv:2112.09153, 2021.
  29. Xingchao Peng et al. Moment matching for multi-source domain adaptation. In ICCV, 2019.
  30. Gdumb: A simple approach that questions our progress in continual learning. In Proceedings of European Conference on Computer Vision, pages 524–540, 2020.
  31. Effect of scale on catastrophic forgetting in neural networks. In Proceedings of the International Conference on Learning Representations, 2021.
  32. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  33. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  34. Overcoming catastrophic forgetting with hard attention to the task. In Proceedings of International Conference on Machine Learning, pages 4548–4557, 2018.
  35. Incremental learning of object detectors without catastrophic forgetting. In Proceedings of the IEEE international conference on computer vision, pages 3400–3409, 2017.
  36. Training data-efficient image transformers & distillation through attention. In Proceedings of International Conference on Machine Learning, pages 10347–10357. PMLR, 2021.
  37. Gido M van de Ven and Andreas S Tolias. Three scenarios for continual learning. arXiv preprint arXiv:1904.07734.
  38. The caltech-ucsd birds-200-2011 dataset. 2011.
  39. Triple-memory networks: A brain-inspired method for continual learning. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  40. Ordisco: Effective and efficient usage of incremental unlabeled data for semi-supervised continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5383–5392, 2021.
  41. Afec: Active forgetting of negative transfer in continual learning. In Advances in Neural Information Processing Systems, volume 34, 2021.
  42. Coscl: Cooperation of small continual learners is stronger than a big one. In European Conference on Computer Vision, pages 254–271. Springer, 2022.
  43. A comprehensive survey of continual learning: Theory, method and application. arXiv preprint arXiv:2302.00487, 2023.
  44. Memory replay with data compression for continual learning. In Proceedings of the International Conference on Learning Representations, 2021.
  45. Dualprompt: Complementary prompting for rehearsal-free continual learning. arXiv preprint arXiv:2204.04799, 2022.
  46. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022.
  47. Mitigating neural network overconfidence with logit normalization. arXiv preprint arXiv:2205.09310, 2022.
  48. Class-incremental learning with strong pre-trained models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9601–9610, 2022.
  49. Large scale incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 374–382, 2019.
  50. Continual object detection via prototypical task correlation guided gating mechanism. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9255–9264, 2022.
  51. Continual learning through synaptic intelligence. In Proceedings of the International Conference on Machine Learning, pages 3987–3995, 2017.
  52. Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems, 34:21984–21996, 2021.
  53. Revisiting few-sample bert fine-tuning. arXiv preprint arXiv:2006.05987, 2020.
  54. Mining unseen classes via regional objectness: A simple baseline for incremental segmentation. NeurIPS, 35, 2022.
  55. Coinseg: Contrast inter- and intra- class representations for incremental segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023.
  56. Prototype augmentation and self-supervision for incremental learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5871–5880, 2021.
  57. Ctp: Towards vision-language continual pretraining via compatible momentum contrast and topology preservation, 2023.
Citations (78)

Summary

  • The paper introduces SLCA as a two-stage method that decouples slow backbone adaptation from post-hoc classifier realignment to mitigate catastrophic forgetting.
  • It employs differential learning rates—using a slow update for the backbone and a faster rate for the classifier—to preserve general representations while adapting to new tasks.
  • Post-hoc classifier alignment using synthetic feature sampling significantly improves performance, achieving up to 50 percentage point gains over standard fine-tuning methods.

The paper introduces SLCA (Slow Learner with Classifier Alignment), a minimalist yet highly effective recipe for class-incremental learning when starting from a large pre-trained vision model.


1. Core Problem

In class-incremental continual learning you want to:

  1. Adapt the pre-trained representation to new tasks (plasticity).
  2. Retain the generic knowledge that future tasks will need (stability).
  3. Balance predictions across the ever-growing label set.

Conventional sequential fine-tuning (same learning-rate for all layers) fails mainly because:

  • Progressive over-fitting: the representation drifts toward the current task and loses generality.
  • Mis-calibrated classifier: the last fully-connected layer is trained on an imbalanced stream and ends up biased toward recent classes.

Prompt–based methods (L2P, DualPrompt) avoid the drift by freezing the backbone, but they sacrifice adaptability and still need custom architectural additions.


2. Proposed Solution: SLCA

SLCA has two completely decoupled stages that can be added to any fine-tuning baseline.

2.1 Slow Learner (SL)

Goal: keep the representation useful for future tasks while still letting it adapt.

Trick: use a much smaller learning rate for the backbone than for the classifier.

1
2
3
lr_backbone = 1e-4     # 50× – 100× smaller than usual
lr_classifier = 1e-2   # 2× usual
optimizer  = SGD (Adam for prompt methods)

You apply these two LRs during the standard training loop for every task. No extra parameters, no replay buffer needed.

Why it works:

  • Small updates ≈ regularisation that discourages catastrophic drift.
  • Classifier still learns fast enough to fit the current task.

2.2 Classifier Alignment (CA)

Even with SL, the FC layer is biased toward the last tasks. CA is a post-hoc correction run only after the final task has been learned and doesn’t touch the backbone.

Step-by-step:

  1. While training task t, store the mean (μᶜ) and covariance (Σᶜ) of embeddings for each new class c ∈ C_t.

1
2
3
# after forward pass
feats = backbone(x)           # [N, D]
cls_stats[class_id].update(feats)

In practice saving μᶜ ∈ ℝᴰ and diagonal var σ²ᶜ is enough (∼0.2 % of ViT-B parameters for 100 classes).

  1. At evaluation time:

a. Sample synthetic features f̃ᶜ ~ 𝒩(μᶜ, Σᶜ) (256 samples per class in the paper).

b. Freeze the backbone, fine-tune only the last linear layer on these synthetic features using logit-normalised cross-entropy to curb over-confidence:

1
2
3
logits = head(f̃)                       # [B, C]
scale  = (1 / τ) / logits.norm(dim=1, keepdim=True)
loss   = CE(scale * logits, targets)    # τ = 0.1 works well
5–20 epochs are enough; cost is <5 % of total runtime.


3. Empirical Findings

Setting Method Split CIFAR-100 Split ImageNet-R Split CUB-200 Split Cars-196
Supervised pre-train (IN-21K) Seq FT 49 % ↓ 50 % ↓ 45 % ↓ 40 % ↓
Supervised pre-train SLCA +49.8 pp +50.0 pp +44.7 pp +40.2 pp
Self-supervised pre-train (MoCo v3) Seq FT
Self-supervised pre-train SLCA closes gap to <4 % from joint training
  • SL alone removes most representation-level forgetting.
  • CA adds 2–20 pp, especially on fine-grained datasets where class overlap is high.
  • Outperforms prompt-based SOTA (DualPrompt) by 5-15 pp while using ⌀0 extra inference FLOPs.

4. Implementation Cheatsheet

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
backbone.requires_grad_(True)
head.requires_grad_(True)

opt = SGD([
    {'params': backbone.parameters(), 'lr': 1e-4},
    {'params': head.parameters(),     'lr': 1e-2}
])

for epoch in range(E):
    for x, y in loader_t:
        logits = head(backbone(x))
        loss   = CE(logits, y)
        loss.backward();  opt.step();  opt.zero_grad()

    # update per-class stats
    with torch.no_grad():
        for x, y in loader_t:
            feats = backbone(x)
            update_stats(stats, feats, y)

head_alignment(head, stats, tau=0.1, samples_per_class=256, epochs=10)

Memory footprint: storing μ and diagonal σ² for D=768 (ViT-B) → 2×768 floats/class. Alignment runtime: O(#classes × S × C) where S=256.


5. Practical Take-aways

  1. Tune LR before designing fancy modules. A two-LR schedule can recover >40 pp.
  2. Post-hoc classifier fixes are cheap and powerful. You don’t always need replay buffers.
  3. Self-supervised pre-training is not inherently better for CL. Methods whose representations require fewer updates (e.g., MoCo v3) pair better with SL.
  4. Fine-grained tasks magnify classifier bias. Always check with a linear probe; if the probe beats your model, add CA.
  5. Scales well: constant time per task, negligible extra memory.

6. Limitations & Open Directions

  • Does not address upstream continual pre-training.
  • Evaluated only on ViT-B/16 classification; extension to detection/segmentation or CNN backbones is future work.
  • CA assumes unimodal (Gaussian) class distributions; might need mixtures for highly multi-modal classes.

Still, SLCA offers a near-free performance boost and a solid new baseline for continual learning on pre-trained vision models.