Recovery Guarantees for Continual Learning of Dependent Tasks: Memory, Data-Dependent Regularization, and Data-Dependent Weights

Published 19 Apr 2026 in cs.LG and math.ST | (2604.17578v1)

Abstract: Continual learning (CL) is concerned with learning multiple tasks sequentially without forgetting previously learned tasks. Despite substantial empirical advances over recent years, the theoretical development of CL remains in its infancy. At the heart of developing CL theory lies the challenge that the data distribution varies across tasks, and we argue that properly addressing this challenge requires understanding this variation--dependency among tasks. To explicitly model task dependency, we consider nonlinear regression tasks and propose the assumption that these tasks are dependent in such a way that the data of the current task is a nonlinear transformation of previous data. With this model and under natural assumptions, we prove statistical recovery guarantees (more specifically, bounds on estimation errors) for several CL paradigms in practical use, including experience replay with data-independent regularization and data-independent weights that balance the losses of tasks, replay with data-dependent weights, and continual learning with data-dependent regularization (e.g., knowledge distillation). To the best of our knowledge, our bounds are informative in cases where prior work gives vacuous bounds.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces a nonlinear regression framework that models autoregressive task dependencies, yielding non-vacuous recovery guarantees for deep models.
It rigorously analyzes experience replay, regularization, and data-dependent weight schemes, showing logarithmic error dependence on inter-task transformations.
The findings reveal that replay-based methods provide tighter error control compared to regularization approaches in sequential task learning.

Recovery Guarantees for Continual Learning of Dependent Tasks

Motivation and Modeling of Task Dependency in Continual Learning

The work addresses a central limitation in the theoretical understanding of continual learning (CL): the proper mathematical modeling of dependencies between tasks. Existing CL theory either focuses on linear or kernelized models, yielding bounds not extensible to deep or nonlinear models, or leverages generalization bounds from classic learning theory, which often become vacuous in the presence of significant distributional shift between tasks and fail to characterize the benefits of learning multiple tasks sequentially.

The authors propose a nonlinear regression framework grounded in the assumption that input data for each new task is generated as a (deterministic, unknown) nonlinear transformation of data from preceding tasks, linking the tasks in an explicit autoregressive dependency structure. Formally, all tasks share a predictor $f^*$ , and, for each $t>1$ , task- $t$ inputs are generated as $x_t = g_t(x_1, \ldots, x_{t-1})$ , with $y_t = f^*(x_t) + v_t$ .

This dependency model unifies and generalizes multiple practical scenarios, including permuted and rotated MNIST benchmarks and autoregressive dynamical systems. Critically, this modeling produces a one-to-one mapping between samples across tasks, circumnavigating analytic pathologies in prior work (such as vacuous or loose generalization bounds growing linearly with the “distance” between task distributions) by leveraging the continuity and structure of the $g_t$ transformations. The sample-level dependency is in stark contrast to both arbitrary (unstructured) dependencies across tasks as assumed in some prior CL bounds, and to the independence (i.i.d.) assumption within standard PAC-learning or multi-task learning with independent tasks.

Theoretical Guarantees: Replay, Regularization, and Data-Dependent Weights

The main technical contributions are the derivation of explicit statistical error bounds for learning $f^*$ in three foundational CL paradigms, under the proposed dependency model and minimal technical assumptions. Theoretical analysis covers experience replay methods, regularization-based methods (with both data-independent and data-dependent regularization), and replay with data-dependent weights.

Experience Replay with Data-Independent Regularization

Analysis of the replay-plus-regularization objective establishes that, for suitable weight selection across tasks and regularization parameter, the (weighted) estimation error

$\frac{1}{T}\sum_{t \in [T]} w_t \, \mathbb{E}\| f^*(x_t) - f_{\hat{\theta}_T}(x_t)\|^2$

admits the upper bound

$\tilde{O}\left( \frac{\nu^2 (p + \log(1/\delta)) + \operatorname{poly}(\sigma) }{ T } \max_{t\in [T]} \frac{w_t}{n_t} \right),$

where $n_t$ is the number of (stored) samples from task- $t>1$ 0, $t>1$ 1 and $t>1$ 2 are sub-Gaussian data and label noise variances, and $t>1$ 3 is the optimizer of the objective. The critical technical feature is that the error dependence on inter-task transformation strength (e.g., scaling parameter $t>1$ 4) is only logarithmic, in contrast to prior domain-discrepancy-based bounds which become vacuous as $t>1$ 5 increases [(2604.17578), Mansour-arXiv2009, Ye-NeurIPS2022]. Further, for uniform task weights and balanced memory, the error decays as $t>1$ 6, matching optimal sample complexity.

Recovery with Data-Dependent Regularization (Knowledge Distillation)

CL methods using explicit distillation-type regularization, which penalizes output discrepancies between the current and earlier predictors on stored samples, are accommodated in this analysis. For exponentially decaying regularization weights $t>1$ 7, the error bound demonstrates exponential down-weighting of errors on older tasks, confirming and quantifying the known phenomenon that regularization-based rehearsal-free approaches incur compounding error over time. The derived bound is

$t>1$ 8

whereas experience replay achieves strictly tighter control (non-exponential) over past task errors.

Recovery with Data-Dependent Weights

Building on frameworks such as dual-optimization and coreset selection, the authors show that the average estimation error for replay with arbitrary data-dependent weights, as long as the weights are bounded in $t>1$ 9 and $t$ 0, has the same asymptotic order as in the data-independent case:

$t$ 1

The requirement on the weight scale is necessary to prevent degenerate sample concentration; the result is robust to randomization and adversarial weight selection.

Implications, Limitations, and Theoretical Consolidation

This work fundamentally demonstrates that, under a task dependency model instantiated as “transformed sample trajectories”—a scenario that empirically matches both synthetic CL benchmarks and real-world dynamical systems—statistically meaningful recovery of the shared predictor is possible, even for deep nonlinear models. Notably:

Error bounds improve (decrease) as the number of tasks grows for balanced memory, and the analytic machinery avoids the multiplicative effects of $t$ 2 seen in previous approaches that invoke concentration inequalities per-task.
Replay-based methods are strictly preferable in error control over regularization-based rehearsal-free approaches under strong task dependency, in quantitative agreement with empirical practice [Prabhu-ECCV2020]. The exponential error decay for older tasks in regularization distillation, as explicitly established here, characterizes a limitation of CL methods relying solely on regularization.
Logarithmic dependence on inter-task transformation strength or distribution shift resolves a previously pathological aspect of CL theory, where increasing distance between task distributions always led to vacuous or uninformative error bounds.

Two explicit caveats in the presented theory are: (1) the requirement that the stored sample count per task grows at least as fast as the model parameter dimension, limiting immediate applicability to severely memory-constrained or highly overparameterized regimes, and (2) the dependency structure, while natural in CL benchmarks, still leaves open the more challenging problem of arbitrary or partially observable inter-task dependencies.

Future Directions and Theoretical Development in Continual Learning

Outlook for future research includes (i) extending the recovery bounds for regularization-based methods to tighter settings, perhaps exploiting linear or kernel structures, (ii) adaptation of the data-dependent weighting guarantees to richer dual-optimization and constrained-memory settings as in [Chamon-TIT2022], and (iii) analysis of the effects of specific algorithmic design (beyond the statistical estimation perspective).

The methodology opens routes to theoretical study in class/domain-incremental learning without task identities and to more general dynamical or reinforcement learning settings in which system states and observations are nonlinearly related over temporal transitions.

Conclusion

By formulating and analyzing continual learning as the recovery of a common predictor from nonlinearly dependent task data, this work provides tight, non-vacuous error bounds for a range of practical CL protocols, with explicit characterization of memory, regularization, and weighting schemes. These results clarify both the potential and intrinsic limitations of classical CL paradigms, and set a theoretical basis for further study of CL with structured task dependencies and beyond.

Markdown Report Issue