Generative Kernel Continual Learning

Updated 13 March 2026

GKCL is a continual learning paradigm that integrates generative replay with kernel classification to effectively address catastrophic forgetting.
The approach leverages a conditional VAE to synthesize samples, eliminating explicit memory buffers and ensuring constant memory cost across tasks.
Empirical benchmarks demonstrate that GKCL achieves higher accuracy and lower forgetting with significantly reduced memory requirements relative to conventional methods.

Generative Kernel Continual Learning (GKCL) is a continual learning paradigm that fuses non-parametric kernel-based classification with generative replay to efficiently address catastrophic forgetting and memory scalability in sequential task settings. Unlike conventional episodic memory-based methods where explicit sample buffers are maintained for replay, GKCL leverages conditional generative models to synthesize representative samples, enabling kernel classifiers to achieve high performance with substantially reduced or constant memory cost. This framework is particularly effective in class-incremental and lifelong learning regimes where the learner must robustly accommodate a growing corpus of tasks without retraining from scratch or accessing all past data (Derakhshani et al., 2021, Derakhshani et al., 2021).

1. Formal Framework

GKCL operates over a sequence of tasks, each providing a dataset $\mathcal{D}_t = \{(x_i^t, y_i^t)\}_{i=1}^{N_t}$ sampled from an evolving, non-stationary distribution. For each task, the system must (i) retain performance on all previously seen tasks, and (ii) efficiently incorporate new task data, under strict constraints on memory and compute. To this end, GKCL maintains two key components:

A conditional variational auto-encoder (VAE) with encoder $q_\phi(z|x)$ and decoder $p_\theta(x|z, y, t)$ , parameterized by learnable class-conditional priors $p(z|y)$ (mixtures of Gaussians).
A per-task kernel ridge regression (KRR) classifier operating on embedded features $\psi(x) = \mathbb{E}_{q_\phi(z|x)}[z]$ extracted by the VAE encoder or by an auxiliary kernel network.

For each new task $t$ , GKCL (a) updates the VAE on both the incoming task data and synthetic replayed data for all previous tasks, sampled from the VAE itself, and (b) generates a synthetic coreset $\mathcal{C}_t$ of $m$ samples per class by decoding $z \sim p(z|y)$ . The kernel classifier is then trained on the union $\mathcal{C}_t \cup \mathcal{D}_t$ , with features $\Psi(X) = [\psi(x_1), \ldots, \psi(x_{N_c})]$ and labels $Y$ in one-hot encoding. The kernel prediction is computed via

$f_c^t(\psi(x')) = \text{softmax}\left(Y(\lambda I + K)^{-1} \tilde K\right),$

where $K = \Psi(X)^\top \Psi(X)$ and $\tilde K = \Psi(X)^\top \psi(x')$ (Derakhshani et al., 2021).

2. Kernel Learning and Generative Replay

In standard kernel continual learning (KCL), each task retains an explicit buffer (coreset) of representative samples, and task-specific classifiers are trained using kernel ridge regression on this buffer. GKCL eliminates the need for explicit memory buffers by employing a VAE-based generative model to produce representative synthetic samples on demand. This generative replay ensures constant memory with respect to the number of tasks $T$ , in contrast to the $\mathcal{O}(T M)$ scaling for KCL where $M$ is the per-task coreset size (Derakhshani et al., 2021).

The VAE is optimized with a combined objective:

Generative Evidence Lower Bound (ELBO) per sample: $\mathcal{L}_{\rm ELBO}(x, y, t) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z, y, t)] - D_{\mathrm{KL}}(q_\phi(z|x) \Vert p(z|y))$
Supervised contrastive regularization to enhance discriminability: $\mathcal{L}_{\rm SC} = \sum_{a=1}^{2N} \left[ -\frac{1}{|P(a)|}\sum_{p\in P(a)}\log \frac{\exp\left(s_a \cdot s_p/\tau\right)}{\sum_{u\neq a} \exp\left(s_a \cdot s_u/\tau\right)} \right].$
The overall VAE + contrastive loss: $\mathcal{L}_{\rm VAE} = \mathcal{L}_{\rm gen}(\theta,\phi) + \alpha \mathcal{L}_{\rm SC}$ where $\alpha=1$ in practice.

The KRR classifier is further fine-tuned by cross-entropy on $\mathcal{D}_t$ , yielding the total loss: $\mathcal{L}_{\rm tot} = \mathcal{L}_{\rm VAE}(\theta,\phi) + \beta \mathcal{L}_{\rm KCL}(\phi,\gamma),$ with $\beta=1$ (Derakhshani et al., 2021).

3. Algorithmic Workflow and Computational Aspects

The main steps of GKCL are:

Training (for each task $t$ ):
- If $t>1$ , sample synthetic replay batches for all previous tasks by decoding from $p_\theta(x|z, y_{\leq t-1}, t_{\leq t-1})$ .
- Update the VAE using both current and replayed data batches with ELBO and supervised contrastive losses.
- For each class $y$ , generate $m$ samples by sampling $z \sim p(z|y)$ and decoding through $p_\theta$ .
- Train the task-specific kernel classifier (linear or non-linear kernel) using the union of real and synthetic samples.
Inference:
- For a query $x'$ and task $t'$ , generate class-conditional pseudo-coresets as needed. Compute the embedded features and kernel response as in training.

Memory is dominated by the generative model parameters ( $\{\theta,\phi\}$ ), and is independent of task count $T$ . Kernel ridge regression solves at most $100 \times 100$ linear systems (with $m=20$ , 5 classes/task) per task in practical settings. Notably, GKCL maintains or exceeds the accuracy of KCL with memory budgets one-tenth as large, and supports dynamic scaling of synthetic coreset size at inference, further improving accuracy without retraining (Derakhshani et al., 2021).

4. Theoretical Properties and Forgetting Mitigation

By allocating a separate kernel classifier (with independent trainable coefficients) for each task, and never overwriting these parameters when incorporating new tasks, GKCL avoids parameter interference that typifies catastrophic forgetting in standard neural networks. The VAE encoder, which is shared across all tasks, is pressured to maintain useful and discriminative representations through the joint ELBO and contrastive losses, while the synthetic coreset mechanism ensures all past tasks remain accessible for evaluation and adaptation (Derakhshani et al., 2021, Derakhshani et al., 2021).

The mixture-of-Gaussians class-conditional latent prior $p(z|y)$ in the generative model aligns the latent space with semantic class boundaries, increasing the sample representativeness and utility for replay-driven kernel training (Derakhshani et al., 2021). Decoder gating (by class or task) further isolates generative pathways, suppressing inter-task interference.

5. Empirical Performance and Benchmarks

GKCL’s efficacy has been established across challenging continual learning benchmarks, including SplitCIFAR100, RotatedMNIST, and PermutedMNIST. On SplitCIFAR100 (20 tasks, 5 classes per task), with $m=20$ samples/class, GKCL attains $72.79 \pm 0.68$ \% accuracy and $0.04$ forgetting, compared to $62.7 \pm 0.89$ \% and $0.06$ for KCL with variational random features. With $m=2$ samples/class, GKCL matches KCL at $m=20$ , achieving a 10-fold memory reduction. Increasing the pseudo-coreset size at inference further yields a $3$–$4$\% accuracy gain (Derakhshani et al., 2021).

Supervised contrastive regularization contributes $+0.9$ \% accuracy on SplitCIFAR100 and $+2.7$ \% on RotatedMNIST. Across tasks, GKCL outperforms EWC, AGEM, iCaRL, GEM, and ER-Reservoir, approaching the multitask upper bound in catastrophic forgetting metrics (Derakhshani et al., 2021, Derakhshani et al., 2021).

Method / Setting	Accuracy (%)	Forgetting	Coreset Size
GKCL (linear, m=20)	72.79 ± 0.68	0.04	20/class
KCL (RF, m=20)	62.7 ± 0.89	0.06	20/class
GKCL (linear, m=2)	≈ KCL (m=20)	–	2/class

6. Limitations and Extensions

GKCL’s primary constraint is the cubic complexity of kernel ridge regression per task ( $\mathcal{O}(N_c^3)$ ) as the synthetic coreset grows. However, this is mitigated by the finding that representative pseudo-coresets can be kept small ( $m=2$ suffices on SplitCIFAR100), and by potential kernel approximation techniques (Nyström, incremental Cholesky, or budgeted solvers) (Derakhshani et al., 2021, Derakhshani et al., 2021).

The framework assumes the ability to identify the relevant task or class at inference to generate the correct synthetic coreset, and relies on the modularity between the feature extractor and the kernel classifier. Extensions may include dynamic random feature expansion/pruning, hypernetworks for kernel basis adaptation, hierarchical kernels, or adaptive allocation of synthetic replay per task/table.

7. Relationship to Other Continual Learning Paradigms

The non-parametric nature of GKCL roots it in the kernel-based continual learning family (Derakhshani et al., 2021) while integrating the generative replay technique central to VAE-based lifelong learning. By decoupling feature learning and kernel inference, GKCL inherits robustness to task interference without the memory scaling drawbacks of episodic memory methods. It contrasts with parametric continual learners that either regularize shared parameters (EWC, SI) or exercise explicit replay (GEM, ER), and it does not require replaying through the entire network as in generative rehearsal approaches. The synergy between synthetic sample generation and kernel learning enables unique trade-offs in memory, scalability, and empirical performance (Derakhshani et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Kernel Continual Learning (2021)

Generative Kernel Continual learning (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Generative Kernel Continual Learning (GKCL).