Papers
Topics
Authors
Recent
2000 character limit reached

Shallow Classifier-Level Forgetting

Updated 8 January 2026
  • Shallow classifier-level forgetting is the degradation in performance at the classifier layer after new class learning despite maintaining linearly decodable deep features.
  • It primarily arises from classifier parameter drift, feature-classifier geometric misalignment, and limited output-level regularization in class-incremental scenarios.
  • Mitigation strategies include modular classifier architectures, output regularization techniques, and discrete bottlenecks that preserve prior decision boundaries.

Shallow classifier-level forgetting refers to the degradation in performance of neural networks at the output layer (classifier/head) on previously learned classes or tasks after learning new information, even when the deep feature representations remain linearly decodable. It manifests primarily in class-incremental and continual learning scenarios, where model updates meant for new classes interfere with the decision boundaries for previously learned classes due to parameter sharing, drift, or insufficiently separated classifier geometries.

1. Mathematical Foundations and Characterization

Shallow forgetting is formally defined as the drop in test accuracy attributable to the classifier layer itself, distinct from representation-level (deep) forgetting. Given a model trained on sequential tasks, let AijA_{ij} denote the accuracy on task jj after completing training on task ii; then shallow forgetting for task jj is

Fijshallow=AjjAijF^{\mathrm{shallow}}_{i \to j} = A_{jj} - A_{ij}

while deep forgetting is measured by fitting an optimal linear probe on frozen features:

Fijdeep=AjjAijF^{\mathrm{deep}}_{i \to j} = A^*_{jj} - A^*_{ij}

where AA^* is the accuracy as measured by the probe. Typically FijshallowFijdeepF^{\mathrm{shallow}}_{i \to j} \geq F^{\mathrm{deep}}_{i \to j}, reflecting the greater susceptibility of the classifier layer to drift (Lanzillotta et al., 8 Dec 2025).

From a geometric perspective, shallow forgetting is tightly linked to the classifier’s inability to capture or retain second-order feature statistics (covariances) as new data causes class means to shift and buffer covariances to degrade (often becoming rank-deficient), leading to inflated means and impaired population boundary recovery (Lanzillotta et al., 8 Dec 2025). Analytically, in sequential binary classification, if w^\hat{w} denotes the equivalent one-class vector for the classifier, then drift in feature space amplifies forgetting in direct proportion to w^\|\hat{w}\| and the correlation w^A,w^B\langle \hat{w}^A, \hat{w}^B \rangle between old and new task classifiers (Huang et al., 2024).

2. Mechanisms and Causes of Shallow Classifier-Level Forgetting

The principal drivers of shallow forgetting include:

  • Classifier parameter drift: When learning new classes, shared classifier weights (i.e., a monolithic linear readout or softmax head) are updated. This can “wash out” prior optima for old classes, especially if no previous samples are replayed, causing “old-logit collapse.” Conversely, new classes may induce “new-logit explosion,” producing excessively large logits that overshadow old ones (Liu et al., 2023).
  • Feature-classifier geometric misalignment: Small changes in feature distributions due to new task learning can translate into large logit margin shifts if the classifier has a large norm or if heads are colinear (Huang et al., 2024).
  • Statistical artifacts under small replay buffers: With minimal experience replay, buffer-based covariances become rank-deficient, yielding unstable solution manifolds for the head and exacerbating shallow forgetting even when linear separability persists in feature space (Lanzillotta et al., 8 Dec 2025).
  • Lack of output-level regularization: Without mechanisms enforcing agreement or scale alignment across heads and sessions, the classifier layer exhibits scale drift and inter-task incongruence (Liu et al., 2023).

3. Key Approaches to Mitigating Shallow Forgetting

Several classes of algorithms specifically target shallow classifier-level forgetting:

3.1 Disjoint and Modular Classifier Architectures

Training a growing sequence of task- or session-specific heads, each only responsible for its associated classes, and freezing past heads is highly effective. At inference, the outputs are concatenated and a global softmax is applied (Bobiev et al., 2021):

  • At session tt, only the new classifier head WtW_t is trained, while old W1,,Wt1W_1,\ldots,W_{t-1} are frozen.
  • Ensemble prediction mitigates drift by preventing parameter overwrites in older heads.
  • Bias-correction layers further improve old-new class calibration.

3.2 Output Regularization and Logit Constraints

Constraining classifier outputs at training time directly reduces drift:

  • ICE (Individual Classifiers with Frozen Extractor): Each session’s classifier is independent and frozen after training. Logit constraints are enforced by introducing:
    • Previous logit alignment (ICE-PL): exposes new heads to old-logit scales.
    • Constant “Other-class” logits (ICE-O): constrains each session’s outputs relative to a fixed threshold (Liu et al., 2023).
  • Margin Dampening: Enforces that the predicted probability for any new class must exceed the highest past-class probability by a safety margin, only penalizing violations. This is combined with KL-based knowledge distillation on buffer examples for output-shape preservation (Pomponi et al., 2024).

3.3 Shallow Bottleneck and Masking Interventions

Introducing discrete, sparse architectural bottlenecks atop the backbone enables “plug-and-forget” interventions:

  • Discrete Key–Value Bottlenecks (DKVB): Each sample activates a sparse subset of discrete slot indices. Forgetting a class is achieved by masking high-usage keys corresponding to that class, instantly and completely severing class information without retraining (Shah et al., 2023).
  • This intervention is agnostic to gradient-based optimization and acts purely at the lookup level in the classifier.

3.4 Fixed and Orthogonal Classifiers

Constraining the classifier norm and decorrelating classifiers across tasks can reduce drift-induced amplification:

  • Fixed Random Classifier Rearrangement (FRCR): Heads are randomly initialized from isotropic Gaussian distributions and then frozen; a rearrangement step (greedy permutation) minimizes the dot product with prior heads, thereby limiting gradient subspace overlap (Huang et al., 2024).

3.5 Self-Distillation and Memory Prioritization

Leveraging shallow-layer features as soft teachers for deeper classifier branches enhances generalizability under buffer-induced memory constraints:

  • Self-distillation across layers: Minimizing KL between similarity distributions in shallow and deep features maintains transferable structure (Nagata et al., 2024).
  • Prioritized replay: Maintaining a buffer of maximally-confused or least-confident examples focuses the model on instances most at risk of shallow forgetting.

3.6 Parameter-Selective Class Unlearning

For class unlearning with limited data, identifying class-relevant parameters via gradient prominence and selectively updating only those while freezing all others allows targeted, efficient output-layer unlearning (Singh et al., 2022).

Method/Class Architectural Principle Memory Output Calibration Freeze Past Heads Explicit Unlearning
ICE / Partial Heads (Liu et al., 2023, Bobiev et al., 2021) Modular, disjoint classifiers Optional Logit constraints Yes No
DKVB (Shah et al., 2023) Discrete sparse bottleneck None N/A N/A Yes
FRCR (Huang et al., 2024) Fixed/orthogonal heads None Implicit Yes No
Margin Dampening + CSC (Pomponi et al., 2024) Soft output constraints + gated cascading Optional Margin + KD Yes (in head) No
ERwP (Singh et al., 2022) Parameter-selective retraining Few-shot KD on retained Partial Yes

4. Empirical Insights, Benchmarking, and Theoretical Guarantees

Multiple studies provide quantitative evidence of these mechanisms:

  • Partial head methods on CIFAR-100 with modest memory (B=10002000B=1000-2000) achieve 10\sim10–$15$ pp gains over strong replay baselines; ablations confirm that freezing each per-task classifier is essential for minimizing shallow forgetting (Bobiev et al., 2021).
  • ICE-O reaches within $2$–$4$ F1 points of joint training upper bounds and outperforms rehearsal methods by up to 44.7%44.7\% absolute F1 in early sessions on information extraction benchmarks (Liu et al., 2023).
  • FRCR achieves <1%<1\% average maximum forgetting on 5-split-MNIST, with robust accuracy gains over EWC and stable SGD (Huang et al., 2024).
  • DKVB masking achieves negative 100%100\% forgetting accuracy (i.e., full suppression of the class) on CIFAR-10/100/LACUNA-100 in $1.6$–$14$ seconds, with retention accuracy loss <0.5%<0.5\%, compared to hundreds of seconds for SCRUB (Shah et al., 2023).
  • Margin Dampening + CSC improves accuracy over strong buffer-based methods by $5$–$10$ points while matching or improving backward transfer (BWT) (Pomponi et al., 2024).
  • Class-unlearning via ERwP drives forget-class accuracy to zero (FAe_e = 0.00%0.00\%), while preserving retained-class accuracy (69.32%69.32\% vs. 69.88%69.88\% in the original on CIFAR-100 ResNet-56), requiring only $10$ epochs and a small fraction of the data (Singh et al., 2022).

5. Replay Efficiency, Buffer Artifacts, and Calibration

Analysis within the Neural Collapse framework reveals a replay efficiency gap: even minimal buffer replay suffices for retaining linearly separable representations, but shallow forgetting (output-level performance) remains high unless large buffers are used (Lanzillotta et al., 8 Dec 2025). This arises because estimator variance and rank-deficiency in small buffers distort the mean/covariance structure, undermining the head’s capacity to recover boundaries, even when deep features remain informative.

Several techniques correct for statistical artifacts:

  • Covariance regularization (Σ^B+αI\hat{\Sigma}_B + \alpha I),
  • Mean-norm calibration f(π)f(\pi) to counteract "North-Star" inflation,
  • Feature subspace augmentation,
  • Proxy-anchored or ETF-based classifiers to ensure population-level alignment.

6. Limitations and Extensions

While many shallow-forgetting mitigation strategies are effective in frozen-backbone, modular-head settings (e.g., class-incremental learning), extensions to deep continual learning with non-frozen backbones or highly entangled representations require additional regularization (e.g., experience replay, weight consolidation). Trade-offs may emerge in final accuracy if classifier norm constraints are too strict (Huang et al., 2024), and incremental buffer scheduling presents further challenges (Pomponi et al., 2024). Modular approaches depend on suitable bias calibration (e.g., BiC layer) to resolve across-head output scales (Bobiev et al., 2021). DKVB masking currently presumes the existence of a discrete bottleneck, which may not be present in generic architectures (Shah et al., 2023).

Future directions include designing multi-class ETF frames for generalized orthogonality (Huang et al., 2024), combining modular and gating-based heads with parameter-efficient adapters, and refining output-level calibration techniques for deeper continual networks.

7. Synthesis and Prescriptions

The study of shallow classifier-level forgetting demonstrates that:

Best practices include always evaluating with linear probes, calibrating output scales across incremental heads or tasks, maintaining high-capacity/robust backbones, and leveraging soft or modular classifier designs to achieve stability-plasticity balance.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Shallow Classifier-Level Forgetting.