Generative Kernel Continual Learning
- GKCL is a continual learning paradigm that integrates generative replay with kernel classification to effectively address catastrophic forgetting.
- The approach leverages a conditional VAE to synthesize samples, eliminating explicit memory buffers and ensuring constant memory cost across tasks.
- Empirical benchmarks demonstrate that GKCL achieves higher accuracy and lower forgetting with significantly reduced memory requirements relative to conventional methods.
Generative Kernel Continual Learning (GKCL) is a continual learning paradigm that fuses non-parametric kernel-based classification with generative replay to efficiently address catastrophic forgetting and memory scalability in sequential task settings. Unlike conventional episodic memory-based methods where explicit sample buffers are maintained for replay, GKCL leverages conditional generative models to synthesize representative samples, enabling kernel classifiers to achieve high performance with substantially reduced or constant memory cost. This framework is particularly effective in class-incremental and lifelong learning regimes where the learner must robustly accommodate a growing corpus of tasks without retraining from scratch or accessing all past data (Derakhshani et al., 2021, Derakhshani et al., 2021).
1. Formal Framework
GKCL operates over a sequence of tasks, each providing a dataset sampled from an evolving, non-stationary distribution. For each task, the system must (i) retain performance on all previously seen tasks, and (ii) efficiently incorporate new task data, under strict constraints on memory and compute. To this end, GKCL maintains two key components:
- A conditional variational auto-encoder (VAE) with encoder and decoder , parameterized by learnable class-conditional priors (mixtures of Gaussians).
- A per-task kernel ridge regression (KRR) classifier operating on embedded features extracted by the VAE encoder or by an auxiliary kernel network.
For each new task , GKCL (a) updates the VAE on both the incoming task data and synthetic replayed data for all previous tasks, sampled from the VAE itself, and (b) generates a synthetic coreset of samples per class by decoding . The kernel classifier is then trained on the union , with features and labels in one-hot encoding. The kernel prediction is computed via
where and (Derakhshani et al., 2021).
2. Kernel Learning and Generative Replay
In standard kernel continual learning (KCL), each task retains an explicit buffer (coreset) of representative samples, and task-specific classifiers are trained using kernel ridge regression on this buffer. GKCL eliminates the need for explicit memory buffers by employing a VAE-based generative model to produce representative synthetic samples on demand. This generative replay ensures constant memory with respect to the number of tasks , in contrast to the scaling for KCL where is the per-task coreset size (Derakhshani et al., 2021).
The VAE is optimized with a combined objective:
- Generative Evidence Lower Bound (ELBO) per sample:
- Supervised contrastive regularization to enhance discriminability:
- The overall VAE + contrastive loss: where in practice.
The KRR classifier is further fine-tuned by cross-entropy on , yielding the total loss: with (Derakhshani et al., 2021).
3. Algorithmic Workflow and Computational Aspects
The main steps of GKCL are:
- Training (for each task ):
- If , sample synthetic replay batches for all previous tasks by decoding from .
- Update the VAE using both current and replayed data batches with ELBO and supervised contrastive losses.
- For each class , generate samples by sampling and decoding through .
- Train the task-specific kernel classifier (linear or non-linear kernel) using the union of real and synthetic samples.
- Inference:
- For a query and task , generate class-conditional pseudo-coresets as needed. Compute the embedded features and kernel response as in training.
Memory is dominated by the generative model parameters (), and is independent of task count . Kernel ridge regression solves at most linear systems (with , 5 classes/task) per task in practical settings. Notably, GKCL maintains or exceeds the accuracy of KCL with memory budgets one-tenth as large, and supports dynamic scaling of synthetic coreset size at inference, further improving accuracy without retraining (Derakhshani et al., 2021).
4. Theoretical Properties and Forgetting Mitigation
By allocating a separate kernel classifier (with independent trainable coefficients) for each task, and never overwriting these parameters when incorporating new tasks, GKCL avoids parameter interference that typifies catastrophic forgetting in standard neural networks. The VAE encoder, which is shared across all tasks, is pressured to maintain useful and discriminative representations through the joint ELBO and contrastive losses, while the synthetic coreset mechanism ensures all past tasks remain accessible for evaluation and adaptation (Derakhshani et al., 2021, Derakhshani et al., 2021).
The mixture-of-Gaussians class-conditional latent prior in the generative model aligns the latent space with semantic class boundaries, increasing the sample representativeness and utility for replay-driven kernel training (Derakhshani et al., 2021). Decoder gating (by class or task) further isolates generative pathways, suppressing inter-task interference.
5. Empirical Performance and Benchmarks
GKCL’s efficacy has been established across challenging continual learning benchmarks, including SplitCIFAR100, RotatedMNIST, and PermutedMNIST. On SplitCIFAR100 (20 tasks, 5 classes per task), with samples/class, GKCL attains \% accuracy and $0.04$ forgetting, compared to \% and $0.06$ for KCL with variational random features. With samples/class, GKCL matches KCL at , achieving a 10-fold memory reduction. Increasing the pseudo-coreset size at inference further yields a $3$–$4$\% accuracy gain (Derakhshani et al., 2021).
Supervised contrastive regularization contributes \% accuracy on SplitCIFAR100 and \% on RotatedMNIST. Across tasks, GKCL outperforms EWC, AGEM, iCaRL, GEM, and ER-Reservoir, approaching the multitask upper bound in catastrophic forgetting metrics (Derakhshani et al., 2021, Derakhshani et al., 2021).
| Method / Setting | Accuracy (%) | Forgetting | Coreset Size |
|---|---|---|---|
| GKCL (linear, m=20) | 72.79 ± 0.68 | 0.04 | 20/class |
| KCL (RF, m=20) | 62.7 ± 0.89 | 0.06 | 20/class |
| GKCL (linear, m=2) | ≈ KCL (m=20) | – | 2/class |
6. Limitations and Extensions
GKCL’s primary constraint is the cubic complexity of kernel ridge regression per task () as the synthetic coreset grows. However, this is mitigated by the finding that representative pseudo-coresets can be kept small ( suffices on SplitCIFAR100), and by potential kernel approximation techniques (Nyström, incremental Cholesky, or budgeted solvers) (Derakhshani et al., 2021, Derakhshani et al., 2021).
The framework assumes the ability to identify the relevant task or class at inference to generate the correct synthetic coreset, and relies on the modularity between the feature extractor and the kernel classifier. Extensions may include dynamic random feature expansion/pruning, hypernetworks for kernel basis adaptation, hierarchical kernels, or adaptive allocation of synthetic replay per task/table.
7. Relationship to Other Continual Learning Paradigms
The non-parametric nature of GKCL roots it in the kernel-based continual learning family (Derakhshani et al., 2021) while integrating the generative replay technique central to VAE-based lifelong learning. By decoupling feature learning and kernel inference, GKCL inherits robustness to task interference without the memory scaling drawbacks of episodic memory methods. It contrasts with parametric continual learners that either regularize shared parameters (EWC, SI) or exercise explicit replay (GEM, ER), and it does not require replaying through the entire network as in generative rehearsal approaches. The synergy between synthetic sample generation and kernel learning enables unique trade-offs in memory, scalability, and empirical performance (Derakhshani et al., 2021).