Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

139 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

5 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model (2303.05118v4)

Published 9 Mar 2023 in cs.CV, cs.AI, and cs.LG

Abstract: The goal of continual learning is to improve the performance of recognition models in learning sequentially arrived data. Although most existing works are established on the premise of learning from scratch, growing efforts have been devoted to incorporating the benefits of pre-training. However, how to adaptively exploit the pre-trained knowledge for each incremental task while maintaining its generalizability remains an open question. In this work, we present an extensive analysis for continual learning on a pre-trained model (CLPM), and attribute the key challenge to a progressive overfitting problem. Observing that selectively reducing the learning rate can almost resolve this issue in the representation layer, we propose a simple but extremely effective approach named Slow Learner with Classifier Alignment (SLCA), which further improves the classification layer by modeling the class-wise distributions and aligning the classification layers in a post-hoc fashion. Across a variety of scenarios, our proposal provides substantial improvements for CLPM (e.g., up to 49.76%, 50.05%, 44.69% and 40.16% on Split CIFAR-100, Split ImageNet-R, Split CUB-200 and Split Cars-196, respectively), and thus outperforms state-of-the-art approaches by a large margin. Based on such a strong baseline, critical factors and promising directions are analyzed in-depth to facilitate subsequent research. Code has been made available at: https://github.com/GengDavid/SLCA.

References (57)

Citations (78)

View on Semantic Scholar

Summary

The paper introduces SLCA as a two-stage method that decouples slow backbone adaptation from post-hoc classifier realignment to mitigate catastrophic forgetting.
It employs differential learning rates—using a slow update for the backbone and a faster rate for the classifier—to preserve general representations while adapting to new tasks.
Post-hoc classifier alignment using synthetic feature sampling significantly improves performance, achieving up to 50 percentage point gains over standard fine-tuning methods.

The paper introduces SLCA (Slow Learner with Classifier Alignment), a minimalist yet highly effective recipe for class-incremental learning when starting from a large pre-trained vision model.

1. Core Problem

In class-incremental continual learning you want to:

Adapt the pre-trained representation to new tasks (plasticity).
Retain the generic knowledge that future tasks will need (stability).
Balance predictions across the ever-growing label set.

Conventional sequential fine-tuning (same learning-rate for all layers) fails mainly because:

Progressive over-fitting: the representation drifts toward the current task and loses generality.
Mis-calibrated classifier: the last fully-connected layer is trained on an imbalanced stream and ends up biased toward recent classes.

Prompt–based methods (L2P, DualPrompt) avoid the drift by freezing the backbone, but they sacrifice adaptability and still need custom architectural additions.

2. Proposed Solution: SLCA

SLCA has two completely decoupled stages that can be added to any fine-tuning baseline.

2.1 Slow Learner (SL)

Goal: keep the representation useful for future tasks while still letting it adapt.

Trick: use a much smaller learning rate for the backbone than for the classifier.

1
2
3

lr_backbone = 1e-4     # 50× – 100× smaller than usual
lr_classifier = 1e-2   # 2× usual
optimizer  = SGD (Adam for prompt methods)

You apply these two LRs during the standard training loop for every task. No extra parameters, no replay buffer needed.

Why it works:

Small updates ≈ regularisation that discourages catastrophic drift.
Classifier still learns fast enough to fit the current task.

2.2 Classifier Alignment (CA)

Even with SL, the FC layer is biased toward the last tasks. CA is a post-hoc correction run only after the final task has been learned and doesn’t touch the backbone.

Step-by-step:

While training task t, store the mean (μᶜ) and covariance (Σᶜ) of embeddings for each new class c ∈ C_t.

1
2
3

# after forward pass
feats = backbone(x)           # [N, D]
cls_stats[class_id].update(feats)

In practice saving μᶜ ∈ ℝᴰ and diagonal var σ²ᶜ is enough (∼0.2 % of ViT-B parameters for 100 classes).

At evaluation time:

a. Sample synthetic features f̃ᶜ ~ 𝒩(μᶜ, Σᶜ) (256 samples per class in the paper).

b. Freeze the backbone, fine-tune only the last linear layer on these synthetic features using logit-normalised cross-entropy to curb over-confidence:

1
2
3

logits = head(f̃)                       # [B, C]
scale  = (1 / τ) / logits.norm(dim=1, keepdim=True)
loss   = CE(scale * logits, targets)    # τ = 0.1 works well

5–20 epochs are enough; cost is <5 % of total runtime.

3. Empirical Findings

Setting	Method	Split CIFAR-100	Split ImageNet-R	Split CUB-200	Split Cars-196
Supervised pre-train (IN-21K)	Seq FT	49 % ↓	50 % ↓	45 % ↓	40 % ↓
Supervised pre-train	SLCA	+49.8 pp	+50.0 pp	+44.7 pp	+40.2 pp
Self-supervised pre-train (MoCo v3)	Seq FT	–	–	–	–
Self-supervised pre-train	SLCA	closes gap to <4 % from joint training

SL alone removes most representation-level forgetting.
CA adds 2–20 pp, especially on fine-grained datasets where class overlap is high.
Outperforms prompt-based SOTA (DualPrompt) by 5-15 pp while using ⌀0 extra inference FLOPs.

4. Implementation Cheatsheet

backbone.requires_grad_(True)
head.requires_grad_(True)

opt = SGD([
    {'params': backbone.parameters(), 'lr': 1e-4},
    {'params': head.parameters(),     'lr': 1e-2}
])

for epoch in range(E):
    for x, y in loader_t:
        logits = head(backbone(x))
        loss   = CE(logits, y)
        loss.backward();  opt.step();  opt.zero_grad()

    # update per-class stats
    with torch.no_grad():
        for x, y in loader_t:
            feats = backbone(x)
            update_stats(stats, feats, y)

head_alignment(head, stats, tau=0.1, samples_per_class=256, epochs=10)

Memory footprint: storing μ and diagonal σ² for D=768 (ViT-B) → 2×768 floats/class. Alignment runtime: O(#classes × S × C) where S=256.

5. Practical Take-aways

Tune LR before designing fancy modules. A two-LR schedule can recover >40 pp.
Post-hoc classifier fixes are cheap and powerful. You don’t always need replay buffers.
Self-supervised pre-training is not inherently better for CL. Methods whose representations require fewer updates (e.g., MoCo v3) pair better with SL.
Fine-grained tasks magnify classifier bias. Always check with a linear probe; if the probe beats your model, add CA.
Scales well: constant time per task, negligible extra memory.

6. Limitations & Open Directions

Does not address upstream continual pre-training.
Evaluated only on ViT-B/16 classification; extension to detection/segmentation or CNN backbones is future work.
CA assumes unimodal (Gaussian) class distributions; might need mixtures for highly multi-modal classes.

Still, SLCA offers a near-free performance boost and a solid new baseline for continual learning on pre-trained vision models.

PDF Markdown

GitHub

GitHub - GengDavid/SLCA: Codes for ICCV 2023 paper: SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model (49 stars)