Class-Continual Learning Experiments

Updated 22 January 2026

Class-continual learning is a framework where models learn new classes sequentially, using replay buffers, knowledge distillation, and regularization to mitigate catastrophic forgetting.
Experimental protocols employ disjoint class splits on datasets like CIFAR-100 and TinyImageNet, evaluating performance without task identifiers and under realistic constraints.
Algorithmic paradigms such as memory-based replay, gradient projection, and contrastive methods provide practical strategies to balance acquiring new knowledge with retaining previous information.

Class-continual learning (class-incremental learning, class-IL) investigates the sequential adaptation of models to streams of classification tasks where new classes are introduced incrementally. The objective is to enable models to learn newly arriving classes while maintaining performance on all previously encountered classes, without access to the full data history. Catastrophic forgetting, where performance on earlier classes degrades upon learning new ones, is the central challenge. Research in this field develops methods, benchmarks, and evaluation metrics for addressing this challenge under increasingly realistic constraints, with a significant body of work emerging on arXiv and related venues.

1. Experimental Protocols and Benchmarking Paradigms

Most class-IL research adopts a sequential task protocol where, at each step $t$ , a model receives data from a disjoint set of classes $C_t$ , with $C_t \cap C_{t'} = \emptyset$ for $t \neq t'$ . After each step, the learner must classify among all $\bigcup_{s\leq t} C_s$ observed classes, operating without explicit task identifiers at inference time.

Representative datasets and splits:

Image domain: CIFAR-10/100 (e.g., 5 $\times$ 2 or 10 $\times$ 10 splits) (Hu et al., 2023, Chen et al., 2023, Yu et al., 2022, Han et al., 2021), TinyImageNet (10 $\times$ 20) (Aghasanli et al., 9 Apr 2025, Cotogni et al., 2022), miniImageNet (20 $\times$ 5) (Boschini et al., 2022), Caltech-256 (Aghasanli et al., 9 Apr 2025).
Video domain: UCF101, Kinetics, ActivityNet—each split into 10 or 20 class increments (Villa et al., 2022).
Specialized domains: Medical imaging (PathMNIST, CheXpert, etc.) with cross-specialty splits (Singh et al., 2023).

Protocols vary along:

Strictness of constraints: Whether a memory buffer is allowed (replay), whether any exemplars or features from old tasks can be stored, constant model capacity, presence/absence of pretraining (Hu et al., 2023, Cotogni et al., 2022, Rymarczyk et al., 2023).
Task structure realism: RealCL evaluates in non-uniform, purely random class-incremental streams (Nasri et al., 2024); OTFCL (Online Task-Free CL) removes precise task boundaries (Dong et al., 2024).
Class repetition: Class-Incremental with Repetition (CIR) allows class reappearance in later tasks and the use of unlabeled external data (Kim et al., 18 Aug 2025).

Evaluation: At the end of each task or task sequence, models are evaluated on unioned test sets, with no information about task identity.

2. Algorithmic Paradigms

Techniques for class-continual learning fall into several main paradigms, with numerous methodological innovations in recent literature:

A. Memory-based Replay

Models maintain a buffer (of data, features, or prototypes) to replay past examples during new task learning. Strategies include:

Experience Replay (ER): Reservoir sampling of raw images or features (Yu et al., 2022, Han et al., 2021, Li et al., 2023), with variants for class balancing (E-BRS) (Li et al., 2023) and selective replay (C-CMR).
Prototype-based Replay: Label-free selection of cluster centers in latent space, with replay buffer storing only prototypes/support, and cluster preservation via MMD (Aghasanli et al., 9 Apr 2025).
Generative Replay: Synthetic data for old classes are generated by VAEs or GANs, with training guided by time-aware regularization of loss terms based on class age (Hu et al., 2023).
Dark Experience Replay (DER/DER++): Storing not just samples but also their historical logits for distillation (Boschini et al., 2022). X-DER (Boschini et al., 2022) further adds memory revision and pre-allocation for future heads.

B. Regularization-based Methods

Introduce penalties to prevent significant drift in parameters critical for previous tasks:

EWC, MAS, SI: Fisher or sensitivity-based penalties on parameter changes (Hu et al., 2023, Singh et al., 2023).
Projected Functional Regularization (PFR): For ViTs, aligns backbone features across task shifts through a chain of learnable projectors (Cotogni et al., 2022).

C. Gradient and Subspace Methods

Orthogonalize updates with respect to important subspaces:

Task and Class Gradient Projection: GPM projects gradients away from task subspaces; CGP extends this to class-specific bases, merging similar classes, and adding supervised contrastive loss for plasticity (Chen et al., 2023).

D. Feature-Space and Contrastive Methods

Leverage embedding alignment and contrastive learning:

Feature Propagation & Contrastive Alignment: Replay and alignment of feature extractors via contrastive losses and embedding propagation (Han et al., 2021).
Diversified Feature Augmentation: MOCA creates intra-class diversity via perturbations, both model-agnostic and model-aware, to mitigate collapse (Yu et al., 2022).
Specialized Video CIL: Temporal consistency loss prepares representations for under-sampled and downsampled video frames (Villa et al., 2022).

E. Exemplar-Free and Interpretability-Preserving Approaches

ICICLE: Exemplar-free, interpretable class-incremental learning with interpretability regularization, prototype proximity-based initialization, and logit bias compensation (Rymarczyk et al., 2023).
Gated Class Attention: Parameter-masking mechanisms in ViTs for class-IL without exemplars, plus drift compensation (Cotogni et al., 2022).

F. Exemplar-Free/Fully Unsupervised Methods

I²CANSAY: Non-exemplar, task-free CL with memoryless online adaptation using inter-class analogical pseudo-features and per-class significance reweighting (Dong et al., 2024).

G. Knowledge Distillation and External Data Use

Multi-level knowledge distillation: Models maintain EMA snapshots for feature/logit distillation on unlabeled/external data, with dynamic SSL regularization and class-repetition (Kim et al., 18 Aug 2025).
Discriminative distillation: Additional loss for similar (confused) class pairs (Zhong et al., 2021).

3. Key Metrics and Their Role

Several metrics have evolved to expose both average and worst-case behavior of continual learners:

Average Accuracy (AAC / ACC): Mean accuracy across all classes seen so far, after each task and finally (Hu et al., 2023, Chen et al., 2023).
Backward Transfer (BWT): Measures forgetting. $BWT=(1/(T−1))\sum_{i=1}^{T−1}(A_{T,i}–A_{i,i})$ (Chen et al., 2023).
Forgetting (FGT / AF): Drop in per-task accuracy since the task was first learned (Han et al., 2021, Michel et al., 2023).
Minimal Incremental Class Accuracy (MICA): Worst-class accuracy after each task; critical for industrial and safety-sensitive applications (Abbas et al., 2024).
Weighted Aggregate MICA (WAMICA): A stability–fairness scalar penalizing large swings in classwise minima (Abbas et al., 2024).
Rescaled Accuracy (RAA) and Forgetting (RAF): Normalize for task difficulty, correcting misleading impressions due to growing class set size (Michel et al., 2023).
Supervised/unsupervised class-based metrics: For methods without access to labels during replay, cluster alignment, and proxy-based accuracy are used (Aghasanli et al., 9 Apr 2025).

A key insight is that average accuracy can systematically overestimate true performance, masking the presence of unlearned classes or especially poor worst-case class performance (Abbas et al., 2024). Adoption of MICA, WAMICA, and rescaled metrics addresses this oversight and provides a more rigorous assessment.

4. Controlled Studies, Ablations, and Model Insights

Ablation and sensitivity analysis play a crucial role in evaluating the contributions of individual components:

Time-aware regularization ablations: Show improved sample quality and accuracy only if both α, β decay dynamically by class age (Hu et al., 2023).
Gradient basis refinement and contrastive regularization: Each adds significant accuracy and reduces forgetting (CGP) (Chen et al., 2023).
Replay buffer size and class-balance sensitivity: Adaptive replay and entropy balancing maintain performance even with smaller buffers and imbalanced classes (Li et al., 2023).
Discriminative and feature-space distillation ablations: Demonstrate that targeted separation of confused class pairs further shrinks error rates vs. generic distillation alone (Zhong et al., 2021).
Unsupervised prototype and pseudo-feature selection: Eliminating label dependence from the buffer or replay yields only modest accuracy loss, and can surpass standard replay-based baselines in some domains (Aghasanli et al., 9 Apr 2025, Dong et al., 2024).
Interpretability regularization and prototype initialization strategies: Simultaneous deployment reduces concept drift and enhances both accuracy and explanatory consistency in CUB-200 class splits (Rymarczyk et al., 2023).

A consistent theme is that strategies combining plasticity (for new-class acquisition) with adaptive regularization, memory diversity, and explicit intra- or inter-class separation yield the most robust, stable performance across realistic incremental learning streams.

5. Challenges, Trends, and Open Directions

Major open challenges and trends in class-continual learning experiments include:

Exemplar-free and privacy-preserving methods: Driven by privacy and resource constraints, research increasingly targets exemplar-free (or fully label-free) replay and representation methods, with mechanisms such as pseudo-feature generation, generative replay, and compressed feature buffering (Hu et al., 2023, Rymarczyk et al., 2023, Aghasanli et al., 9 Apr 2025, Dong et al., 2024).
Realism in experimental design: More complex scenarios, such as RealCL (random/imbalanced class streams) (Nasri et al., 2024), CIR (class-incremental with repetition and large external unlabeled data) (Kim et al., 18 Aug 2025), and concept drift adaptation (reactive subspace buffers) (Korycki et al., 2021), expose the limitations of methods tuned for idealized, task-structured benchmarks.
Evaluation best practices: There is a continuing shift away from reporting only average accuracy, with newer work rigorously tracking worst-case accuracy, fairness across classes, and proper normalization for expanding class sets (Abbas et al., 2024, Michel et al., 2023).
Modular, plug-and-play components: Model-agnostic regularization (e.g., MOCA), orthogonalization (CGP, GPM), and interpretability-aware layers (ICICLE, Gated Class-Attention) are being adopted to dissociate stability–plasticity tradeoffs from dataset specifics (Yu et al., 2022, Chen et al., 2023, Cotogni et al., 2022).

6. Domain-Specific and Multi-Domain Class-Continual Learning

Recent research evaluates class-IL in domains where class definitions, data distributions, and dynamics differ substantially from standard vision tasks:

Medical imaging CL: Scenario-based evaluation reveals that hybrid approaches—memory-based replay with regularization—reduce forgetting and perform robustly in inter-hospital, cross-specialty, and intra-specialty shifts (Singh et al., 2023).
Video CL: Temporal frame-subsampling, memory constraints, and action-class imbalances create unique continual learning bottlenecks, addressed by temporal consistency losses and frame-level replay (Villa et al., 2022).
Fine-grained and high-class-count domains: Datasets such as CUB-200, Stanford Cars, and Caltech-256 are used with interpretable and prototype-based methods, requiring new regularization strategies for both performance and explanation stability (Rymarczyk et al., 2023, Aghasanli et al., 9 Apr 2025).

This diversification of domain and protocol emphasizes the need for flexible, general-purpose class-IL methodology and rigorous, task-structure-agnostic evaluation.

These converging advances collectively define the state of class-continual learning experiments as of 2026: an active field characterized by methodological breadth, increasingly realistic benchmarks, and a robust, metric-driven evaluation culture.