Continuation-Based Learning Framework

Updated 4 July 2026

Continuation-based learning is a paradigm that organizes adaptation as a structured progression of evolving objects—such as task streams, posteriors, or recurrent states.
It spans diverse settings including continual learning, sequential Bayesian and meta-learning, curriculum design, and optimization by continuation in reinforcement and analytic tasks.
The framework emphasizes a balance between retaining historical information and adapting to new data, addressing challenges like catastrophic forgetting and scenario variability.

A continuation-based learning framework is a family of learning formulations in which adaptation proceeds through a structured progression rather than a single static optimization. In current arXiv usage, the phrase appears in several technically distinct settings: continual learning under non-iid task streams, sequential Bayesian and meta-continual learning, curriculum-ordered online class-incremental learning, optimization by continuation in reinforcement learning, and supervised operator learning for analytic continuation in numerical quantum many-body physics (Douillard et al., 2021, Farquhar et al., 2019, Singh et al., 2022, Bolland et al., 2023, Xie et al., 2019). Across these uses, the common motif is that learning is mediated by an evolving object—such as a scenario, a posterior, a latent concept distribution, a recurrent state, or a smoothed objective—that carries information from earlier stages into later ones.

1. Conceptual range of the term

In the cited literature, “continuation-based” does not denote a single standardized architecture. Instead, it denotes a methodological pattern in which learning is organized along a path: over tasks, over data-distribution shifts, over latent posterior updates, or over progressively less smoothed objectives (Douillard et al., 2021, Lee et al., 2024, Lee et al., 2023, Bolland et al., 2023, Xie et al., 2019).

Setting	Continued object	Representative formulation
Continual learning infrastructure	Ordered task stream	Dataset $\rightarrow$ Taskset $\rightarrow$ Scenario
Bayesian/meta-continual learning	Posterior or sufficient statistics	Sequential Bayesian update over $\boldsymbol{\omega}$ or $z$
Replay/distillation frameworks	Latent concept distribution or past representation space	GMM-based replay, predictor-based SSL distillation
Curriculum and sequence modeling	Class order or recurrent state	Curriculum over classes; CL as forward pass of a sequence model
Optimization/analytic continuation	Smoothed objective or inverse operator	Mirror policies; $G(\tau)\mapsto A(\omega)$

A useful way to organize the literature is by asking what is being continued. In continual-learning systems, the continued object is usually a task-conditioned data stream or a compact memory of prior tasks. In Bayesian and meta-learning systems, it is a posterior distribution or a fixed-dimensional sufficient statistic. In reinforcement learning, it is a smoothed surrogate of the return. In analytic continuation, it is a learned map from noisy imaginary-time data to spectral density. This suggests that the expression names a shared research style rather than a single problem domain.

2. Scenario construction and experimental infrastructure

In continual learning, the central difficulty is that the training distribution is not static, drifts through time, and can induce interference with previously learned knowledge. “Continuum: Simple Management of Complex Continual Learning Scenarios” formalizes this engineering problem and treats data loading itself as a first-class research object (Douillard et al., 2021). Its core abstraction separates Datasets, Tasksets, and Scenarios: raw data are wrapped into task-specific subsets, and a scenario is an ordered sequence of tasks constituting the actual continual-learning curriculum.

This separation supports several common regimes. The framework provides incremental learning, in which new tasks bring new concepts or classes; lifelong learning, in which the class set stays fixed but the input distribution changes; mixed incremental and lifelong settings, also called NIC; and transformation-based scenarios such as Rotation-MNIST and Permuted-MNIST. It also supports metadata-driven datasets such as CORe50, where sample identifiers can be used to construct NIC-style streams. The same base dataset can therefore be instantiated as class-incremental, domain-incremental, transformed, or metadata-conditioned streams.

The framework also standardizes evaluation. Its Logger includes accuracy, average incremental accuracy, online cumulative performance, backward transfer, remembering, positive backward transfer, and model size efficiency. A central trajectory-sensitive metric is average incremental accuracy,

$\frac{1}{T}\sum_{t=1}^{T} A^t,$

where $A^t$ is the accuracy after task $t$ . The framework further distinguishes performance metrics, behavior metrics, and computational metrics, reflecting the fact that continual learning is assessed over a sequence of models rather than a single terminal checkpoint.

A notable design point is reproducibility. The framework is explicitly motivated by the observation that small errors in preprocessing, task order, or test-set construction can significantly distort continual-learning results. It therefore adopts a deliberately narrow scope—mainly data loading and metrics—described in the paper as a Unix philosophy. The implementation is also extendable: each object can be inherited, and new datasets, task partitioning rules, transformations, or supervision signals can be added without rewriting the rest of the pipeline. In this literature, continuation is thus operationalized as controlled scenario construction rather than as a new forgetting-avoidance algorithm.

3. Sequential Bayesian, bilevel, and unified optimization formulations

A major theoretical line treats continual learning as sequential Bayesian inference. “A Unifying Bayesian View of Continual Learning” starts from the exact recursive update

$p(\boldsymbol{\omega}\mid \mathcal{D}_{1:t}) \propto p(\mathcal{D}_t \mid \boldsymbol{\omega})\, p(\boldsymbol{\omega}\mid \mathcal{D}_{1:t-1}),$

and argues that practical continual learning becomes difficult because the posterior is intractable for models such as Bayesian neural networks (Farquhar et al., 2019). The paper distinguishes prior-focused methods, which reuse the previous approximate posterior as the new prior—covering VCL, EWC, SI, Riemannian Walk, and related methods—from likelihood-focused methods, which preserve past-task likelihood contributions via replay or generative models, including DGR, pseudo-rehearsal, core-set replay, and the paper’s own VGR. It then derives a hybrid objective combining both views. An important empirical claim is that prior-focused methods can fail in realistic settings such as single-headed Split MNIST, where VCL performs much worse than replay-based approaches and exhibits poorly calibrated uncertainty, whereas VGR behaves more sensibly.

“Bilevel Continual Learning” recasts the same broad problem as a meta-learning-style online optimization with a dual memory system (Pham et al., 2020). It uses episodic memory $M^{er}$ for replay in the inner problem and generalization memory $\rightarrow$ 0 for a validation-like outer problem. For each incoming mini-batch $\rightarrow$ 1, the method solves

$\rightarrow$ 2

The fast weights $\rightarrow$ 3 are initialized from the main model $\rightarrow$ 4, adapted on current data plus replay, and then used to update $\rightarrow$ 5 through a first-order approximation. A distillation regularizer reduces bias toward the current task when replay memory is small. The reported benchmarks—Permuted MNIST, Split CIFAR-100, Split CUB, and Split miniImagenet—show BCL-Dual as generally the strongest variant across ACC, FM, and LA.

“Learning to Continually Learn with the Bayesian Principle” pushes the Bayesian view further by freezing the neural network entirely during continual learning and allowing only a statistical latent-variable model to update (Lee et al., 2024). In Sequential Bayesian Meta-Continual Learning (SB-MCL), continual learning is represented by a posterior over an episode-specific latent variable $\rightarrow$ 6, while the neural network is meta-trained beforehand to map raw data into the likelihood terms required by that posterior. For a factorized Gaussian posterior, the sequential update is exact:

$\rightarrow$ 7

Because the neural network performs only forward passes during the stream, the framework is explicitly protected from catastrophic forgetting by design. The paper reports substantial efficiency gains in meta-training time, including classification: OML 6.5 hr, TF 1.2 hr, SB-MCL 40 min, and analogous advantages in completion, VAE, and DDPM settings.

A separate unification is provided by “A Unified and General Framework for Continual Learning,” which expresses CL as

$\rightarrow$ 8

with output-space and weight-space Bregman divergences (Wang et al., 2024). Under special choices of $\rightarrow$ 9, $\boldsymbol{\omega}$ 0, and the references $\boldsymbol{\omega}$ 1 or $\boldsymbol{\omega}$ 2, the framework recovers EWC, CPR, VCL, NCL, ER, and DER. Its additional refresh learning plug-in performs deliberate unlearning followed by relearning of the current mini-batch. Reported gains include ER on CIFAR-100 Class-IL: $\boldsymbol{\omega}$ 3 and DER++ on CIFAR-100 Task-IL: $\boldsymbol{\omega}$ 4. These results formalize a recurring tension in the literature: retention is not always identical to preserving every earlier parameter configuration unchanged.

4. Representation continuation through replay, distillation, and drift adaptation

Some continuation-based frameworks preserve knowledge primarily in representation space rather than in parameter priors. “Generative Continual Concept Learning” proposes ECLA, an autoencoder-classifier architecture in which new forms of a concept are coupled to previously learned forms in a shared latent space (Rostami et al., 2019). The encoder $\boldsymbol{\omega}$ 5 maps inputs into a discriminative embedding, a decoder $\boldsymbol{\omega}$ 6 makes the system generative, and a Gaussian mixture model (GMM) over latent clusters approximates a task-independent concept distribution. For later tasks, the method trains on both current labeled samples and pseudo-replayed samples generated from the latent GMM, while aligning current and historical latent distributions using Sliced Wasserstein Distance (SWD). The paper evaluates this on Permuted MNIST and sequential MNIST $\boldsymbol{\omega}$ 7 USPS / USPS $\boldsymbol{\omega}$ 8 MNIST, with only the first task fully labeled and about 10 labeled examples per class for the new task in the transfer setting. The reported effects include retained first-task performance, a jump-start on the second task, and embeddings that merge same-class instances across domains into shared clusters.

“Self-Supervised Models are Continual Learners” develops CaSSLe, which turns a self-supervised loss into a continual distillation mechanism by adding a predictor network $\boldsymbol{\omega}$ 9 that maps current features into the previous representation space (Fini et al., 2021). Instead of directly constraining $z$ 0 to equal $z$ 1, CaSSLe learns

$z$ 2

and uses the same SSL objective both for current-task learning and for temporal distillation. The method is applied to SimCLR, MoCoV2+, BYOL, SwAV, Barlow Twins, and VICReg. It uses a 2-layer MLP with 2048 hidden units and ReLU, sums the SSL and distillation losses without an additional weighting hyperparameter, and adds about 30% extra memory/time overhead. Reported average linear-evaluation gains are about 6.8% on class-incremental CIFAR100, about 4% on ImageNet100, and about 4.4% on domain-incremental DomainNet.

A different representation-level continuation appears in “Class-Incremental Experience Replay for Continual Learning under Concept Drift,” which explicitly combines continual learning with drift adaptation (Korycki et al., 2021). The method maintains a centroid-driven memory with replay buffers attached to multiple centroids per class and augments it with a Reactive Subspace Buffer (RSB) that tracks local label changes. Replay is purity-aware:

$z$ 3

and clusters are split when

$z$ 4

The paper evaluates MNIST, FASHION, SVHN, CIFAR10, and IMAGENET10 under both stationary and drift settings. In the drift setting, reported normalized average accuracies for ER-RSB are 0.9938 on MNIST, 0.9745 on FASHION, 0.9722 on SVHN, 0.9545 on CIFAR10, and 0.9187 on IMAGENET10. A central implication is that a continuation-based learner need not preserve all historical information indiscriminately; it may instead preserve valid subspaces and revise obsolete ones.

5. Curriculum ordering and sequence-model formulations

Another research line treats continuation as the temporal ordering of classes or tasks. “Learning to Learn: How to Continuously Teach Humans and Machines” studies online class-incremental continual learning in which each example is shown once and the curriculum is the order of classes, $z$ 5 (Singh et al., 2022). Curriculum quality is evaluated by $z$ 6, the average accuracy over all seen classes, and $z$ 7, the drop in first-task accuracy between the first and last task, combined into

$z$ 8

The paper proposes a feature-based Curriculum Designer (CD) using SqueezeNet features pretrained on ImageNet, with 500 images per class to compute prototypes. The ranking heuristic favors a centrally located first class, larger inter-class distances in the early middle portion, and late classes similar to the first class as a replay-like reinforcement mechanism. Human experiments on the Novel Object Dataset (NOD) retained 169 MTurk subjects after filtering, for 34,848 test trials, and also included an in-lab study with 60 subjects. The supplement reports Spearman correlations on MNIST paradigm-I of 0.26 between algorithms, 0.08 between algorithm and CD, and 0.0002 between algorithm and random ranking; curriculum discrepancy $z$ 9 decreases from random-human to algorithm-human by about 0.13. In this literature, continuation is the deliberate progression of concepts through a teaching order.

“Recasting Continual Learning as Sequence Modeling” goes further by replacing inner-loop parameter updates with state updates inside a sequence model (Lee et al., 2023). A continual learner is written as a functional

$G(\tau)\mapsto A(\omega)$ 0

and the stream-processing version is

$G(\tau)\mapsto A(\omega)$ 1

In the proposed formulation, the continual-learning process becomes the forward pass of a sequence model, with the hidden state $G(\tau)\mapsto A(\omega)$ 2 carrying accumulated information from the stream. Decoder-only Transformers and kernel-based efficient Transformers such as Linear Transformer and Performer are then used as meta-continual learners. For kernel-based attention, each head maintains recurrent statistics

$G(\tau)\mapsto A(\omega)$ 3

yielding linear-time state updates. The paper reports experiments on seven benchmarks, covering both classification and regression, with default $G(\tau)\mapsto A(\omega)$ 4 tasks per episode and also $G(\tau)\mapsto A(\omega)$ 5 for longer episodes. This suggests a shift from parameter continuation to state continuation: the learner is not repeatedly re-optimized but instead accumulates episode-specific state through causal sequence processing.

6. Optimization by continuation and analytic continuation beyond standard continual learning

Outside continual-learning benchmarks, continuation-based learning also denotes objective smoothing and learned inverse operators. “Policy Gradient Algorithms Implicitly Optimize by Continuation” formulates direct policy optimization in the optimization by continuation framework (Bolland et al., 2023). Instead of optimizing a nonconvex objective directly, one optimizes a family of smoother surrogate objectives

$G(\tau)\mapsto A(\omega)$ 6

For reinforcement learning, the continuation of the policy return is defined by perturbing policy parameters over time, $G(\tau)\mapsto A(\omega)$ 7, and the resulting smoothed objective is shown to equal the return of a mirror policy

$G(\tau)\mapsto A(\omega)$ 8

For affine deterministic policies under Gaussian perturbations, the mirror policy becomes Gaussian with covariance

$G(\tau)\mapsto A(\omega)$ 9

The paper interprets stochastic exploration and entropy regularization as mechanisms that maintain this smoothing, and argues that policy variance should be a history-dependent continuation parameter chosen to avoid poor local extrema rather than merely to maximize immediate return.

“Analytic Continuation of Noisy Data Using Adams Bashforth ResNet” uses the term in yet another sense: learning the inverse operator that maps noisy imaginary-time data to a spectral function in quantum many-body physics (Xie et al., 2019). The forward relation is the Fredholm integral equation

$\frac{1}{T}\sum_{t=1}^{T} A^t,$ 0

and the baseline MaxEnt method minimizes

$\frac{1}{T}\sum_{t=1}^{T} A^t,$ 1

The proposed AB-ResNet replaces iterative inverse-problem solving with a supervised map $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 2 learned by a residual network interpreted as an ODE discretization. Standard ResNet corresponds to AB1, and the paper proposes higher-order AB2 and AB3 multistep residual updates. Synthetic training data are generated from Gaussian-mixture spectra, noise levels are $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 3, $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 4, $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 5, the imaginary-time signal is represented by 64 Legendre polynomial coefficients, and the reported dataset sizes are 100,000 for training, 1000 for validation, and 1000 for testing. Mean absolute errors are $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 6 for AB1, $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 7 for AB2, and $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 8 for AB3. Reported CPU times are on the order of $\frac{1}{T}\sum_{t=1}^{T} A^t,$ 9 seconds for AB-ResNet versus $A^t$ 0 seconds for MaxEnt. Here, continuation does not mean task sequencing at all; it means learning a stable approximation to an ill-posed continuation operator.

Across these literatures, the term “continuation-based learning framework” refers to a common structural principle rather than a single canonical method. What is continued may be a task stream, a posterior, a latent concept model, a curriculum, a recurrent state, a smoothed control objective, or an inverse physical operator. A recurring point of disagreement concerns whether historical information should always be preserved: prior-focused Bayesian CL, replay systems, and infrastructure frameworks emphasize retention, whereas concept-drift adaptation and refresh learning explicitly allow outdated information to be discarded or unlearned (Farquhar et al., 2019, Korycki et al., 2021, Wang et al., 2024). The literature therefore presents continuation not as a uniform algorithmic recipe, but as a broad technical paradigm for organizing learning over structured progression.