Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
53 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
10 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Class-Incremental Learning

Updated 15 July 2025
  • Class-incremental learning is a continual learning paradigm where models train on sequential classification tasks without accessing all past data.
  • It employs methods like regularization, rehearsal, and bias correction to balance between retaining old knowledge and learning new classes.
  • Recent advancements use synthetic data and prototype learning to overcome catastrophic forgetting and improve performance in dynamic environments.

Class-incremental learning (CIL) is a continual learning paradigm in which a model must learn a sequence of classification tasks, each introducing new classes, while retaining performance on previously learned classes—without being provided task identifiers at test time or jointly accessing all previously encountered data. This scenario is central to real-world deployments in dynamic environments where the number of classes evolves and is complicated by catastrophic forgetting, domain shifts, data imbalance, and constraints on memory or data retention.

1. Core Principles and Challenges

Class-incremental learning requires a balance between stability (retaining knowledge about old classes) and plasticity (rapidly learning new classes). The major challenge is catastrophic forgetting, i.e., a severe reduction in performance on previously learned classes when the model acquires new information. In CIL, the model must:

  • Discriminate among all previously seen classes at inference, without relying on a task or phase identifier.
  • Achieve this with limited or no access to data from prior classes (due to privacy, storage, or computational constraints).
  • Address associated phenomena such as weight drift, activation drift, bias toward recent classes (task–recency bias), and inter-task confusion (2010.15277).

The stability-plasticity dilemma is fundamental—methods must preserve old knowledge while allowing efficient integration of new classes without excessive retraining or storage.

2. Methodological Taxonomy

CIL solutions can be broadly categorized as follows (2010.15277, 2011.01844):

  1. Regularization-Based Methods These approaches introduce additional loss terms, such as:
    • Weight regularization (e.g., Elastic Weight Consolidation, Memory Aware Synapses), which penalize updates on parameters deemed important for prior tasks.
    • Activation or output regularization (e.g., Learning without Forgetting, Less-forgetting Learning) using knowledge distillation to maintain previous activations.
  2. Rehearsal or Exemplar-Based Approaches A small buffer stores exemplar samples for each class. When new classes arrive, both new data and these exemplars are used for joint training, typically combined with knowledge distillation. Exemplar selection strategies include random sampling and herding (which selects samples close to class means) (2011.01844).
  3. Bias Correction and Output Rectification Approaches like BiC and IL2M introduce bias correction layers or statistic-driven rectification of the prediction logits to mitigate the tendency of CIL models to favor recently learned classes due to data imbalance.
  4. Architectural Approaches These include fixed classifiers (with weights pre-allocated as regular polytopes), component-wise modularization, or classifier expansion strategies (2010.08657).
  5. Experience Replay and Generative Replay Involves either storing past examples (experience replay) or regenerating pseudo-exemplars using generative models.
  6. Novel Class Detection and Incremental Expansion Recent work in open-world CIL formalizes the tasks of unknown class detection and subsequent model expansion, introducing curriculum clustering, prototype-based representations, and robust regularization (2008.13351).

3. Model Update Mechanisms

CIL model updates proceed incrementally as follows (2011.01844, 2010.15277):

  • The model is updated for each batch of new classes, using only their data and (if allowed) a limited set of exemplars from previous classes.
  • Loss functions typically blend a classification loss on the current batch with a distillation loss preserving outputs (or intermediate features) on earlier classes.
  • Additional mechanisms such as bias correction, exemplar memory management (e.g., herding), and selective feature consolidation are often incorporated.
  • In some settings, model parameters may be partitioned—e.g., frozen feature extractor + trainable classifier head (see Section 4).

Formally, the update at incremental step tt may be written as

L=λLcls+(1λ)LdistillL = \lambda \cdot L^\text{cls} + (1 - \lambda) \cdot L^\text{distill}

where LclsL^\text{cls} is a cross-entropy loss over all classes seen by step tt, and LdistillL^\text{distill} is a distillation loss (e.g., KL divergence) aligning old predictions/features.

4. Knowledge Preservation Strategies

Exemplar Management and Replay

A widely adopted mechanism for knowledge preservation is rehearsal—retaining a fixed memory buffer of representative exemplars for previous classes. Effective exemplar selection strategies like herding aim to approximate the class mean within the feature space, which has been shown to enhance old class retention under memory constraints (2011.01844).

Distillation and Feature-Level Constraints

Knowledge distillation is prevalent; outputs or intermediate features of the current model are encouraged to match those of the previous model (2012.08129, 2204.00895). Feature-graph preservation and adaptive feature consolidation further extend this by regularizing changes in feature representations—weighted according to their importance, which is estimated through their sensitivity to loss changes (2204.00895). Science supports that such adaptive strategies can mitigate loss increases due to model update.

Output and Representation Bias Correction

Output bias, especially against old classes, is addressed with scaling or offset corrections at the classifier output (BiC (2010.15277)), or by normalization techniques such as rectified cosine normalization (2012.08129).

Frozen Feature Extractors and Prototype Learning

Recent EFCIL research demonstrates that freezing the feature extractor after a diverse initial training phase (ideally on a wide class set) and updating only the classifier head significantly reduces catastrophic forgetting and computational cost (2404.03200). To counteract poor generalization when the initial class set is small, text-to-image diffusion models can generate synthetic samples of “future” classes for feature extractor pre-training, a strategy shown to outperform both random real images and less diverse synthetic data (2404.03200).

Prototype learning—where each class is represented by a prototype vector in latent space—offers another way to maintain decision boundaries without retaining raw data (2308.02346). When the encoder is fixed (preferably via self-supervised pre-training to maximize intrinsic dimensionality of the feature space), learning new prototypes and using stop-gradient prevents old class drift and classifier distortion.

5. Advancements in Specialized CIL Settings

Audio-Visual CIL

In multi-modal video CIL, where both audio and visual streams are available, joint learning must preserve not only intra-modal but also inter-modal semantic correlations. The AV-CIL approach enforces both instance-level and class-level constraints between audio and visual features, alongside attention distillation to preserve cross-modal mappings as classes are added (2308.11073). This framework is evaluated on the newly introduced AVE-CI, K-S-CI, and VS100-CI datasets.

Multi-Label and Open-World CIL

For multi-label classification, where each instance may have multiple class labels (often with class overlap and imbalance), CIL systems must preserve high-dimensional output consistency and avoid exclusive dependence on problem-specific constraints. The combination of cosine similarity-based and KL-divergence based distillation losses is effective in these settings, achieving minimal performance degradation over incremental phases (2401.04447).

Open-world CIL further requires detection of novel (unseen) classes and on-the-fly model expansion, motivating curriculum clustering, prototype-based regularization, and decoupled intra- and inter-class embedding objectives (2008.13351).

Synthetic Data for Pre-training and Replay

Building on advances in generative diffusion models, synthetic images can be generated for unseen classes and used for either initial feature extractor training or as replay data during incremental phases. Strategies that employ text-to-image models for anticipated “future” classes have demonstrated substantial performance improvements in EFCIL scenarios over real auxiliary datasets, especially when diversity in synthetic samples is emphasized (2404.03200, 2306.17560).

Task Prediction via Likelihood Ratio

In traditional CIL, the absence of a task identifier at inference poses task prediction challenges. The TPL (Task-id Prediction based on Likelihood Ratio) method frames task prediction as a Neyman–Pearson likelihood ratio test between the probability of a sample under the current task’s distribution and the union of the others. This score, when combined with Mahalanobis and KNN-based density estimators, robustly solves this problem and achieves negligible catastrophic forgetting (2309.15048).

Active Selection and Class Balance

Label-efficient CIL methods are moving from random or few-shot sample annotation toward active selection from large unsupervised pools (Active CIL or ACIL). The CBS (Class-Balanced Selection) strategy clusters the feature space and greedily selects samples to minimize the KL divergence between the selected and unlabeled feature distributions within each cluster. This procedure ensures class balance and informativeness, leading to better downstream classification performance and robustness to class imbalance, particularly when integrated with prompt-based pretrained models (2412.06642).

7. Applications and Evaluation Protocols

CIL methodologies are evaluated across numerous datasets, including CIFAR-100, ImageNet, fine-grained benchmarks (CUB-200, Flowers-102), video datasets (AVE, Kinetics-Sounds, VGGSound), and audio benchmarks (Audioset). Scenarios consider both balanced and imbalanced splits, small and large domain shifts, disjoint and overlapping classes, and resource-constrained regimes. Performance is reported as average incremental accuracy, forgetting rates, and balanced accuracy on old and new classes (2011.01844, 2208.03767). High-performing approaches demonstrate robust mitigation of forgetting, scalability to thousands of classes, and practical applicability in streaming data and privacy-sensitive environments.


This synthesis reviews the principal challenges, methodological divisions, update mechanisms, preservation strategies, specialized settings, and recent advancements characterizing class-incremental learning, as well as the state-of-the-art empirical results and evaluation standards used in the literature.