Exemplar-Free Class-Incremental Learning

Updated 13 January 2026

The paper introduces TACLE, an exemplar-free class-incremental learning framework that mitigates catastrophic forgetting without replaying past exemplars.
It employs analytic classifier updates, regularization, synthetic data generation, and adaptive reweighting to balance stability and plasticity.
Empirical results on CIFAR100 and ImageNet-100 demonstrate significant accuracy gains under privacy constraints and low-label conditions.

An exemplar-free class-incremental learning (CIL) framework is an approach in which a model learns a sequence of classification tasks, each introducing new classes, without storing or replaying any samples (“exemplars”) from previous tasks. This paradigm addresses long-standing challenges of catastrophic forgetting while adhering to stringent memory, privacy, or legal constraints that prohibit preservation of raw data from past classes. Exemplars are strictly disallowed: only model weights, auxiliary class statistics, or synthetic proxies may be used, making the prevention of forgetting (stability) while learning new concepts (plasticity) an inherently difficult problem. Below, core principles, methodologies, and representative frameworks for exemplar-free CIL are outlined, focusing on advanced contributions such as TACLE (Kalla et al., 2024), as well as recent advances in multimodal, vision, video, and graph-based domains.

1. Problem Foundations and Motivation

Exemplar-free CIL formalizes the scenario where, at each increment (task) $t$ in a sequence of $T$ tasks, one receives only data for new classes $\mathcal{Y}^{(t)}$ and is prohibited from storing data from $\bigcup_{k=1}^{t-1} \mathcal{Y}^{(k)}$ . The setting is motivated by applications where privacy or storage constraints forbid rehearsal, such as in medical or federated settings.

Mathematical structure: At task $t$ , the model receives:

A labeled dataset $\mathcal{D}_L^{(t)} = \{(x_i, y_i) : y_i \in \mathcal{Y}^{(t)}\}$ , size $N_L^t$ .
An optional (in semi-supervised settings) unlabeled dataset $\mathcal{D}_U^{(t)}$ .
The model’s prediction space expands to $\mathcal{Y}_{1:t} = \bigcup_{k=1}^t \mathcal{Y}^{(k)}$ .

The goal is to maximize accuracy across all classes seen so far, balancing the need to learn new concepts (plasticity) and to prevent overwriting old knowledge (stability), without any form of explicit replay memory.

Catastrophic forgetting emerges acutely under these constraints. Baseline fine-tuning or gradient-based training quickly erases features required for earlier classes. Consequently, a large body of research has developed regularization, analytic updating, synthetic replay, and non-parametric strategies that attempt to approximate stability without exemplars (Kalla et al., 2024, He et al., 2024, Huang et al., 2024, Petit et al., 2023, Sun et al., 2022).

2. Core Methodological Components

The architecture and training strategies of exemplar-free CIL methods differ markedly from exemplar-based systems. Major research themes are:

2.1. Analytic, Ridge, or Recursive Classifier Updates

Gradient-free analytic solutions, such as Recursive Least Squares (RLS) or ridge regression classifiers, propagate all-new class statistics via closed-form updates—obviating the need for explicit replay (He et al., 2024, Yang et al., 2024, You et al., 7 Sep 2025). This approach freezes the feature backbone after a base phase, and classifies by updating and growing a linear (or Gaussian) classifier using compact statistics (mean, covariance) of new classes.

2.2. Regularization and Alignment Penalties

Forcing model outputs or internal representations to stay close to their pre-update values is central to many approaches. Regularizers include:

Feature or logit-level distillation from the previous model (task $t-1$ ) on new task data
Consistency/alignments, such as L2-matching between features before/after update (Kalla et al., 2024, Huang et al., 2024)
Task-adaptive thresholding on pseudo-labels for semi-supervised learning (Kalla et al., 2024), where only confidently predicted classes contribute pseudo-supervision

2.3. Synthetic Data Generation

Exemplar-free frameworks sometimes sidestep memory constraints through synthetic data—either by self-distilled generative models (e.g., SKD (Ye et al., 2022)) or text-to-image diffusion pipelines (Jodelet et al., 2024, Meng et al., 2024). Synthetic samples are used as surrogates for real past examples during training, enabling a degree of rehearsal or consolidation despite zero access to the real past data.

2.4. Augmentation and Manifold Coverage

Augmentation strategies such as rotation, mixup, or more advanced synthetic mixings are deployed to increase the coverage of the learned feature space and ensure class boundaries remain robust as the model is updated for new tasks (Huang et al., 2024, Chen et al., 2023). These augmentations may be combined with 1-NN prototype classification to prevent classifier drift.

2.5. Adaptive Reweighting and Balancing

To prevent bias toward new or majority classes, frameworks often employ class-aware or task-aware loss weighting:

Weighting cross-entropy losses by the inverse frequency of each class, with adjustable “rare class” exponent (Kalla et al., 2024)
Balancing representation via normalization, temperature reweighting, or prototype translation (Hogea et al., 2024, Kalla et al., 2024)

3. TACLE: Task and Class-aware Semi-Supervised Exemplar-Free CIL

TACLE extends the classic exemplar-free CIL paradigm to the challenging semi-supervised setting, where at each task $t$ only a very small fraction of labeled examples are available for the new classes, and most task data is unlabeled. The innovations of TACLE are:

3.1. Task-Adaptive Threshold for Pseudo-Labeling

TACLE defines a dynamic threshold $\tau_t$ for pseudo-label selection that decays over tasks, formulated as:

$\tau_t = \frac{\alpha}{1 + \exp(\alpha t)} + \beta$

Only unlabeled samples with maximum predicted class probability exceeding $\tau_t$ are included in the pseudo-labeled set, controlling the plasticity–stability balance (Kalla et al., 2024).

3.2. Class-Aware Weighted Cross-Entropy Loss

The supervised and pseudo-labeled samples are used with a weighted cross-entropy loss, up-weighting under-represented classes by:

$w_c = (1/(n_c + \varepsilon))^{\gamma}$

where $n_c$ is the count for class $c$ , $\gamma$ adjusts the reweighting degree, and $\varepsilon$ avoids division by zero. This prevents domination by head classes and enhances performance for rare classes.

3.3. Classifier Alignment/Consistency Regularizer

A consistency loss is applied over random perturbations of unlabeled data:

$L_{\text{align}}^{(t)} = \mathbb{E}_{x \in \mathcal{D}_U} \|p(\cdot|x';\theta, W) - p(\cdot|x'';\theta, W)\|_2^2$

This further regularizes the feature space and improves boundary smoothness as new classes are incorporated.

3.4. Unified Objective

At each task, the full loss is:

$L^{(t)}(\theta,W) = L_{\text{sup}}^{(t)} + \lambda_p L_{\text{pseudo}}^{(t)} + \lambda_a L_{\text{align}}^{(t)}$

Hyperparameters $\lambda_p$ , $\lambda_a$ are tuned to control the total influence of each term.

Algorithmic Outline (high-level):

Compute task-adaptive threshold $\tau_t$ ;
Warm-up the model on the few labeled samples;
Alternate batches of labeled and pseudo-labeled unlabeled data, applying weighted cross-entropy and consistency losses;
Optionally slow training of earlier class parameters;
Progress through all tasks without storing any exemplars.

4. Empirical Results and Benchmark Validation

TACLE and related frameworks have demonstrated significant advances on classical vision benchmarks:

Dataset/Setting	Method	Avg. Cumulative Acc. (%)
CIFAR100, 10 tasks, 0.8% labeled	SLCA	63.7
	SLCA + fixed $\tau$	88.2
	TACLE	92.35
ImageNet-100, 1-shot, 20 tasks	SLCA	59.48
	SLCA + fixed $\tau$	61.23
	TACLE	67.73

TACLE improves performance even in highly imbalanced unlabeled data configurations, with accuracy degradation less than 5% (vs. >10% for baselines in skewed head-tail settings). Experiments validate the effectiveness of MoCo v3/ViT backbones, task-adaptive thresholding, and class-aware weighting in extreme low-label and imbalanced regimes (Kalla et al., 2024).

5. Architectural and Implementation Considerations

Common architectural elements include:

Large pre-trained backbone networks (ViT, ResNet) for high-quality feature representations;
Freezing early layers for initial epochs to stabilize learning in low-data settings;
SGD with momentum, adjustable batch sizes, and learning rate decay adapted to each dataset’s scale.

Hyperparameters are tuned for the specific labeled data fraction and backbone. For example, $\alpha=0.5$ , $\beta=0.65$ provide initial pseudo-label thresholds $\tau_1\approx0.9$ decaying to $\tau_T\approx0.65$ . Pseudo-loss and alignment-loss weights are selected from $[0.5, 1.0]$ and $[0.1, 0.5]$ ranges, respectively. Batch sizes and optimizer details are adapted based on memory limits of the task and model.

The framework is compatible with both supervised ImageNet and contrastive MoCo pre-trained weights, with flexibility for subsampling large unlabeled pools for compute reasons.

Exemplar-free CIL comprises a diverse methodological spectrum:

Analytic/recursive approaches (REAL, ReFu, MCIGLE, etc.) (He et al., 2024, Yang et al., 2024, You et al., 7 Sep 2025);
Geometric and combinatorial models for prototype-based tessellation (iVoro, Progressive Voronoi) (Ma et al., 2022);
Self-distilled, synthetic replay and data-free replay methodologies (SKD, diffusion-based FPCIL/DiffClass/SSIA) (Ye et al., 2022, Jodelet et al., 2024, Meng et al., 2024);
Graph and multimodal extensions for non-Euclidean, multi-domain data with optimal transport alignment and residual fit (MCIGLE, ReFu) (You et al., 7 Sep 2025, Yang et al., 2024);
Video/temporal task extensions with spatial-temporal adapters and causal distillation (StPR, CSTA) (Wang et al., 20 May 2025, Chen et al., 13 Jan 2025);
Non-parametric 1-NN classification on frozen features with augmentation-based coverage (IR) (Huang et al., 2024);
Explicit task and class-aware balancing, pseudo-label filtering, and prototype regularization (TACLE, DPCR, PlaStIL) (Kalla et al., 2024, He et al., 7 Mar 2025, Petit et al., 2022).

These methods vary in their trade-offs:

Analytic and prototype-based methods offer strong stability, but may lose some plasticity for fine-grained new classes.
Regularization- and distillation-based approaches can interpolate between stability and plasticity but typically require well-calibrated hyperparameters and high-quality backbone initialization.

7. Open Challenges, Limitations, and Future Directions

Current limitations and open avenues in exemplar-free class-incremental learning include:

Scalability to very large numbers of incremental tasks, with growing classifier size or memory requirements for storing analytic matrices/statistics;
Expressivity of fixed/frozen backbones as domains shift or as more diverse and “out-of-distribution” tasks are introduced;
Reliance on high-quality synthetic data when using data-free generators or diffusion models—domain shift and generative quality can limit effectiveness;
Extension to more complex modalities (e.g., video, graph, 3D, audio) requiring consistent multimodal alignment and domain adaptation;
Robustness to hard class imbalance and extremely low supervision regimes (e.g., true one-shot learning).

Recent results, including robust operation under extreme data imbalance and state-of-the-art performance on ImageNet-100, suggest that task- and class-aware thresholding, analytic solution updating, and carefully tuned regularization provide a path toward scalable, privacy-compliant continual learning (Kalla et al., 2024, Jodelet et al., 2024, He et al., 2024, Ma et al., 2022, Yang et al., 2024). Ongoing research extends these frameworks to new modalities, resource-constrained deployment, and open-world settings without any pre-specified class order.