Shallow Classifier-Level Forgetting
- Shallow classifier-level forgetting is the degradation in performance at the classifier layer after new class learning despite maintaining linearly decodable deep features.
- It primarily arises from classifier parameter drift, feature-classifier geometric misalignment, and limited output-level regularization in class-incremental scenarios.
- Mitigation strategies include modular classifier architectures, output regularization techniques, and discrete bottlenecks that preserve prior decision boundaries.
Shallow classifier-level forgetting refers to the degradation in performance of neural networks at the output layer (classifier/head) on previously learned classes or tasks after learning new information, even when the deep feature representations remain linearly decodable. It manifests primarily in class-incremental and continual learning scenarios, where model updates meant for new classes interfere with the decision boundaries for previously learned classes due to parameter sharing, drift, or insufficiently separated classifier geometries.
1. Mathematical Foundations and Characterization
Shallow forgetting is formally defined as the drop in test accuracy attributable to the classifier layer itself, distinct from representation-level (deep) forgetting. Given a model trained on sequential tasks, let denote the accuracy on task after completing training on task ; then shallow forgetting for task is
while deep forgetting is measured by fitting an optimal linear probe on frozen features:
where is the accuracy as measured by the probe. Typically , reflecting the greater susceptibility of the classifier layer to drift (Lanzillotta et al., 8 Dec 2025).
From a geometric perspective, shallow forgetting is tightly linked to the classifier’s inability to capture or retain second-order feature statistics (covariances) as new data causes class means to shift and buffer covariances to degrade (often becoming rank-deficient), leading to inflated means and impaired population boundary recovery (Lanzillotta et al., 8 Dec 2025). Analytically, in sequential binary classification, if denotes the equivalent one-class vector for the classifier, then drift in feature space amplifies forgetting in direct proportion to and the correlation between old and new task classifiers (Huang et al., 2024).
2. Mechanisms and Causes of Shallow Classifier-Level Forgetting
The principal drivers of shallow forgetting include:
- Classifier parameter drift: When learning new classes, shared classifier weights (i.e., a monolithic linear readout or softmax head) are updated. This can “wash out” prior optima for old classes, especially if no previous samples are replayed, causing “old-logit collapse.” Conversely, new classes may induce “new-logit explosion,” producing excessively large logits that overshadow old ones (Liu et al., 2023).
- Feature-classifier geometric misalignment: Small changes in feature distributions due to new task learning can translate into large logit margin shifts if the classifier has a large norm or if heads are colinear (Huang et al., 2024).
- Statistical artifacts under small replay buffers: With minimal experience replay, buffer-based covariances become rank-deficient, yielding unstable solution manifolds for the head and exacerbating shallow forgetting even when linear separability persists in feature space (Lanzillotta et al., 8 Dec 2025).
- Lack of output-level regularization: Without mechanisms enforcing agreement or scale alignment across heads and sessions, the classifier layer exhibits scale drift and inter-task incongruence (Liu et al., 2023).
3. Key Approaches to Mitigating Shallow Forgetting
Several classes of algorithms specifically target shallow classifier-level forgetting:
3.1 Disjoint and Modular Classifier Architectures
Training a growing sequence of task- or session-specific heads, each only responsible for its associated classes, and freezing past heads is highly effective. At inference, the outputs are concatenated and a global softmax is applied (Bobiev et al., 2021):
- At session , only the new classifier head is trained, while old are frozen.
- Ensemble prediction mitigates drift by preventing parameter overwrites in older heads.
- Bias-correction layers further improve old-new class calibration.
3.2 Output Regularization and Logit Constraints
Constraining classifier outputs at training time directly reduces drift:
- ICE (Individual Classifiers with Frozen Extractor): Each session’s classifier is independent and frozen after training. Logit constraints are enforced by introducing:
- Previous logit alignment (ICE-PL): exposes new heads to old-logit scales.
- Constant “Other-class” logits (ICE-O): constrains each session’s outputs relative to a fixed threshold (Liu et al., 2023).
- Margin Dampening: Enforces that the predicted probability for any new class must exceed the highest past-class probability by a safety margin, only penalizing violations. This is combined with KL-based knowledge distillation on buffer examples for output-shape preservation (Pomponi et al., 2024).
3.3 Shallow Bottleneck and Masking Interventions
Introducing discrete, sparse architectural bottlenecks atop the backbone enables “plug-and-forget” interventions:
- Discrete Key–Value Bottlenecks (DKVB): Each sample activates a sparse subset of discrete slot indices. Forgetting a class is achieved by masking high-usage keys corresponding to that class, instantly and completely severing class information without retraining (Shah et al., 2023).
- This intervention is agnostic to gradient-based optimization and acts purely at the lookup level in the classifier.
3.4 Fixed and Orthogonal Classifiers
Constraining the classifier norm and decorrelating classifiers across tasks can reduce drift-induced amplification:
- Fixed Random Classifier Rearrangement (FRCR): Heads are randomly initialized from isotropic Gaussian distributions and then frozen; a rearrangement step (greedy permutation) minimizes the dot product with prior heads, thereby limiting gradient subspace overlap (Huang et al., 2024).
3.5 Self-Distillation and Memory Prioritization
Leveraging shallow-layer features as soft teachers for deeper classifier branches enhances generalizability under buffer-induced memory constraints:
- Self-distillation across layers: Minimizing KL between similarity distributions in shallow and deep features maintains transferable structure (Nagata et al., 2024).
- Prioritized replay: Maintaining a buffer of maximally-confused or least-confident examples focuses the model on instances most at risk of shallow forgetting.
3.6 Parameter-Selective Class Unlearning
For class unlearning with limited data, identifying class-relevant parameters via gradient prominence and selectively updating only those while freezing all others allows targeted, efficient output-layer unlearning (Singh et al., 2022).
| Method/Class | Architectural Principle | Memory | Output Calibration | Freeze Past Heads | Explicit Unlearning |
|---|---|---|---|---|---|
| ICE / Partial Heads (Liu et al., 2023, Bobiev et al., 2021) | Modular, disjoint classifiers | Optional | Logit constraints | Yes | No |
| DKVB (Shah et al., 2023) | Discrete sparse bottleneck | None | N/A | N/A | Yes |
| FRCR (Huang et al., 2024) | Fixed/orthogonal heads | None | Implicit | Yes | No |
| Margin Dampening + CSC (Pomponi et al., 2024) | Soft output constraints + gated cascading | Optional | Margin + KD | Yes (in head) | No |
| ERwP (Singh et al., 2022) | Parameter-selective retraining | Few-shot | KD on retained | Partial | Yes |
4. Empirical Insights, Benchmarking, and Theoretical Guarantees
Multiple studies provide quantitative evidence of these mechanisms:
- Partial head methods on CIFAR-100 with modest memory () achieve –$15$ pp gains over strong replay baselines; ablations confirm that freezing each per-task classifier is essential for minimizing shallow forgetting (Bobiev et al., 2021).
- ICE-O reaches within $2$–$4$ F1 points of joint training upper bounds and outperforms rehearsal methods by up to absolute F1 in early sessions on information extraction benchmarks (Liu et al., 2023).
- FRCR achieves average maximum forgetting on 5-split-MNIST, with robust accuracy gains over EWC and stable SGD (Huang et al., 2024).
- DKVB masking achieves negative forgetting accuracy (i.e., full suppression of the class) on CIFAR-10/100/LACUNA-100 in $1.6$–$14$ seconds, with retention accuracy loss , compared to hundreds of seconds for SCRUB (Shah et al., 2023).
- Margin Dampening + CSC improves accuracy over strong buffer-based methods by $5$–$10$ points while matching or improving backward transfer (BWT) (Pomponi et al., 2024).
- Class-unlearning via ERwP drives forget-class accuracy to zero (FA = ), while preserving retained-class accuracy ( vs. in the original on CIFAR-100 ResNet-56), requiring only $10$ epochs and a small fraction of the data (Singh et al., 2022).
5. Replay Efficiency, Buffer Artifacts, and Calibration
Analysis within the Neural Collapse framework reveals a replay efficiency gap: even minimal buffer replay suffices for retaining linearly separable representations, but shallow forgetting (output-level performance) remains high unless large buffers are used (Lanzillotta et al., 8 Dec 2025). This arises because estimator variance and rank-deficiency in small buffers distort the mean/covariance structure, undermining the head’s capacity to recover boundaries, even when deep features remain informative.
Several techniques correct for statistical artifacts:
- Covariance regularization (),
- Mean-norm calibration to counteract "North-Star" inflation,
- Feature subspace augmentation,
- Proxy-anchored or ETF-based classifiers to ensure population-level alignment.
6. Limitations and Extensions
While many shallow-forgetting mitigation strategies are effective in frozen-backbone, modular-head settings (e.g., class-incremental learning), extensions to deep continual learning with non-frozen backbones or highly entangled representations require additional regularization (e.g., experience replay, weight consolidation). Trade-offs may emerge in final accuracy if classifier norm constraints are too strict (Huang et al., 2024), and incremental buffer scheduling presents further challenges (Pomponi et al., 2024). Modular approaches depend on suitable bias calibration (e.g., BiC layer) to resolve across-head output scales (Bobiev et al., 2021). DKVB masking currently presumes the existence of a discrete bottleneck, which may not be present in generic architectures (Shah et al., 2023).
Future directions include designing multi-class ETF frames for generalized orthogonality (Huang et al., 2024), combining modular and gating-based heads with parameter-efficient adapters, and refining output-level calibration techniques for deeper continual networks.
7. Synthesis and Prescriptions
The study of shallow classifier-level forgetting demonstrates that:
- Output-layer drift, not loss of deep separability, is the dominant source of catastrophic forgetting in many class-incremental scenarios (Lanzillotta et al., 8 Dec 2025, Davari et al., 2022),
- Modular/divided classifier architectures and output-layer constraints can nearly eradicate forgetting without heavy replay or architectural complexity (Liu et al., 2023, Bobiev et al., 2021, Pomponi et al., 2024),
- Discrete bottlenecks and parameter-selective retraining provide near-instant, fully-compliant class erasure for privacy or regulatory compliance (Shah et al., 2023, Singh et al., 2022),
- Buffer-induced statistical artifacts must be explicitly modeled and corrected for robust shallow performance under constrained memory (Lanzillotta et al., 8 Dec 2025).
Best practices include always evaluating with linear probes, calibrating output scales across incremental heads or tasks, maintaining high-capacity/robust backbones, and leveraging soft or modular classifier designs to achieve stability-plasticity balance.