Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 33 tok/s

Gemini 2.5 Pro 51 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 26 tok/s Pro

GPT-4o 74 tok/s Pro

Kimi K2 188 tok/s Pro

GPT OSS 120B 362 tok/s Pro

Claude Sonnet 4.5 34 tok/s Pro

2000 character limit reached

Mathematics of Continual Learning (2504.17963v1)

Published 24 Apr 2025 in cs.LG

Abstract: Continual learning is an emerging subject in machine learning that aims to solve multiple tasks presented sequentially to the learner without forgetting previously learned tasks. Recently, many deep learning based approaches have been proposed for continual learning, however the mathematical foundations behind existing continual learning methods remain underdeveloped. On the other hand, adaptive filtering is a classic subject in signal processing with a rich history of mathematically principled methods. However, its role in understanding the foundations of continual learning has been underappreciated. In this tutorial, we review the basic principles behind both continual learning and adaptive filtering, and present a comparative analysis that highlights multiple connections between them. These connections allow us to enhance the mathematical foundations of continual learning based on existing results for adaptive filtering, extend adaptive filtering insights using existing continual learning methods, and discuss a few research directions for continual learning suggested by the historical developments in adaptive filtering.

Summary

The paper establishes that methods from adaptive filtering, such as LMS and APA, provide a solid mathematical foundation to balance learning new data with retaining previous knowledge.
It demonstrates that iterative algorithms like LMS converge exponentially fast under repeated tasks and that gradient projection methods in deep networks prevent forgetting by projecting updates onto null spaces.
The study reveals that recursive approaches, including RLS and Kalman Filter, not only average past solutions but also offer positive backward transfer, underpinning the connection between linear models and nonlinear adaptations.

This paper provides a tutorial overview of the mathematical connections between continual learning (CL) and adaptive filtering (AF), arguing that insights from AF can provide a stronger mathematical foundation for understanding and developing CL methods. The core idea is that both fields grapple with the problem of learning from sequential data while balancing new information with previously acquired knowledge.

Continual Learning and Adaptive Filtering

Continual learning aims to train a machine learning model on a sequence of tasks without suffering from catastrophic forgetting, where performance on previous tasks degrades significantly upon learning new ones. Modern deep learning approaches for CL include:

Replay-based methods: Store a subset of past data and replay it during training on new tasks.
Constrained optimization methods: Learn new tasks while ensuring performance on old tasks remains above a certain threshold, often formulated as optimization problems with constraints.
Regularization-based methods: Add penalty terms to the objective function to keep the model parameters close to their values for previous tasks.
Expansion-based methods: Add new parameters or network components for each new task.

Adaptive filtering, a classic field in signal processing, focuses on updating model parameters online as new data arrives. Early examples, like the Least Mean Squares (LMS) algorithm, embodied a "minimal disturbance principle" – adapting to reduce the current error while minimizing disruption to existing knowledge – which closely aligns with the goals of continual learning.

Despite differences (CL often deals with nonlinear models and uncorrelated task data, while classic AF focuses on linear models and temporally correlated data), the paper argues for deep connections based on three observations:

Theoretical analysis of CL methods often relies on simplified linear models, where connections to AF become apparent.
Core AF methods (LMS, APA, RLS, KF) can be extended to handle data without strong temporal correlation, fitting the CL setting.
These AF methods can be adapted for nonlinear models and deep networks through linearization, layer-wise application, or focusing on training linear layers on features from pre-trained models.

The paper then reviews specific AF methods and their CL counterparts, highlighting these connections.

Least Mean Squares (LMS)

In the context of sequential tasks with one sample per task, LMS updates the model $\theta^{t-1}$ to $\theta^t$ to fit the current sample $(x_t, y_t)$ while staying close to $\theta^{t-1}$ . Mathematically, it can be seen as a constrained optimization problem:

$\min_{\theta} \| \theta - \theta^{t-1} \|_2^2 \quad \textnormal{s.t.} \quad y_t - x_t^\top \theta = (1-\gamma) (y_t - x_t^\top \theta^{t-1})$

This leads to the familiar SGD-like update: $\theta^t = \theta^{t-1} - \gamma (x_t^\top \theta^{t-1} - y_t) x_t$ .

For linear models ( $y_i = x_i^\top \theta^*$ ) and normalized inputs ( $\|x_i\|_2 = 1$ ), the squared error on task $i$ using the model from task $j$ is $\epsilon_{ij}^2 = (y_i - x_i^\top \theta^j)^2$ . Continual learning aims to keep the mean squared error (MSE) $E_t = \frac{1}{t}\sum_{i=1}^t \epsilon_{it}^2$ low.

The paper shows that for i.i.d. data or cyclically repeated tasks, LMS converges to the true model $\theta^*$ exponentially fast, meaning the MSE decreases over tasks. For example, under a 2-recurring task sequence, the MSE at task $T$ decays as $O(c^{T-1})$ , where $c = (x_1^\top x_2)^2$ (2504.17963). Optimal stepsizes can even achieve perfect recovery after a few tasks (2504.17963). This theoretical analysis shows that LMS performs well when tasks are revisited, implicitly mitigating forgetting through repeated exposure.

Affine Projection Algorithm (APA)

APA extends LMS by using a memory buffer of size $b$ containing the $b$ most recent samples. For the current task $t$ , it solves:

$\min_{\theta} \| \theta - \theta^{t-1} \|_2^2 \quad \textnormal{s.t.} \quad y_{t-i} = x_{t-i}^\top \theta, \quad i=0, \dots, b$

This projects the previous model onto the solution space of the current and $b$ most recent tasks.

A special case, APA $^\dagger$ , uses all past data ( $b=t-1$ ). The paper proves that APA $^\dagger$ (starting from $\theta^0=0$ ) implicitly solves the minimum-norm problem:

$\min_{\theta} \| \theta \|_2^2 \quad \textnormal{s.t.} \quad y_{:t} = X_{:t}^\top \theta$

This highlights APA $^\dagger$ 's implicit bias towards the minimum-norm solution consistent with all seen data (2504.17963).

The Ideal Continual Learner (ICL) framework [Peng-ICML2023] for linear models is formulated as minimizing the current task's loss subject to previous tasks being solved optimally:

$\min_{\theta} (y_t - x_t^\top \theta)^2 \quad \textnormal{s.t.} \quad y_{:t-1} = X_{:t-1}^\top \theta$

The paper shows that an online update derived for ICL is equivalent to APA $^\dagger$ , further solidifying the connection and its implicit bias towards the minimum-norm solution (2504.17963). This update uses the current sample $x_t$ projected onto the null space of previous data $X_{:t-1}^\top$ .

Gradient Projection (GP) methods for deep networks [Saha-ICLR2021, Wang-CVPR2021] are presented as extensions of these ideas. GP updates layer parameters $\Theta_\ell$ by projecting the standard update direction ( $\Delta_t^\ell$ ) onto the null space of previously seen output features from the layer below ( $P_{t-1}^{\ell-1} \Delta_t^\ell$ ). The paper proves that this ensures previous outputs $f_{\theta^t}^\ell(x_i)$ remain unchanged for all $i < t$ , thus preventing forgetting on past inputs (2504.17963). Challenges arise in practice when feature spaces grow, making exact projections computationally expensive or impossible. Approximations (like low-rank projections) lose the exact non-forgetting guarantee, illustrating a trade-off. The idea can also be applied to nonlinear models by linearizing around the current estimate.

Recursive Least-Squares (RLS)

RLS minimizes a weighted sum of squared errors across all past tasks, with weights typically decaying exponentially via a forgetting factor $\beta \in (0, 1]$ :

$\min_{\theta} \lambda \| \theta \|_2^2 + \sum_{i=1}^t \frac{(y_i - x_i^\top \theta)^2}{\beta^i}$

Online updates for the model $\theta^t$ and the inverse Hessian $\Phi_t$ are derived. The paper shows that RLS with $\beta \to 0$ becomes equivalent to the minimum-norm problem and the APA $^\dagger$ /ICL family (2504.17963). When a single true model $\theta^*$ doesn't exist for all tasks, RLS computes a weighted average of task solutions. If task solutions are far apart, this average may perform poorly on all tasks.

RLS concepts are applied to deep networks through layer-wise updates (e.g., Enhanced Back Propagation, Orthogonal Weight Modification [Zeng-NMI2019]) or by training a dynamically expanding linear classifier on features from a pre-trained model [Azimi-TNN1993, Zhuang-NeurIPS2022]. While empirical benefits are shown, theoretical non-forgetting guarantees for RLS with arbitrary $\beta$ are less clear than for GP.

Kalman Filter (KF)

KF addresses a more general linear Gaussian model setting where each task $i$ has its own "true model" $\theta_i$ , and these models evolve over time according to a linear transition model ( $\theta_i = A_i \theta_{i-1} + w_i$ ) with Gaussian noise. Measurements for task $i$ relate to $\theta_i$ linearly ( $y_i = X_i^\top \theta_i + v_i$ ) with Gaussian noise, potentially involving multiple samples per task.

The goal is to estimate the sequence of task models $\theta_1, \dots, \theta_t$ . Using the Maximum A Posteriori (MAP) principle, this amounts to estimating the conditional means $E[\theta_i | y_{:t}]$ for $i \leq t$ . KF recursively computes the mean $\theta_{t|t}$ and covariance $\Sigma_{t|t}$ for the current task $t$ given all data seen so far ( $y_{:t}$ ) via prediction and correction steps.

The paper shows that RLS is a special case of KF when the task transition matrix $A_i=I$ and the state noise $w_i=0$ , meaning $\theta_i = \theta_{i-1}$ (a single true model) and the noise $v_i$ on measurements has a specific structure related to $\beta$ .

A key advantage of the KF framework is the concept of positive backward transfer. While KF computes $\theta_{t|t}$ , the Rauch-Tung-Striebel (RTS) smoother can be used to compute $\theta_{i|t}$ for $i < t$ . The paper proves that conditioning on more data (up to task $t$ instead of task $s < t$ ) results in estimates with smaller error covariance: $\Sigma_{i|t} \preceq \Sigma_{i|s}$ for $i \leq s < t$ . This means the estimate for a past task $i$ improves as more subsequent tasks are learned, formally demonstrating positive backward transfer under the linear Gaussian model (2504.17963).

Extensions of KF to nonlinear models (Extended Kalman Filter - EKF), layer-wise applications, and training linear classifiers on pre-trained features are also discussed.

Conclusion

The paper concludes by emphasizing that adaptive filtering theory provides a strong mathematical lens for understanding and advancing continual learning. Key AF principles like the minimal disturbance principle and the structure of recursive updates are highly relevant to CL algorithms designed to mitigate forgetting. The connections between LMS, APA, RLS, and KF and various CL methods (SGD for linear regression, ICL, Gradient Projection, layer-wise updates, methods using pre-trained models) are consolidated, highlighting how memory (explicit or implicit through projections/covariances) and task relationships play crucial roles. The paper suggests future research directions based on these connections, such as exploring other AF areas, developing theoretical guarantees for nonlinear deep continual learning, and studying "prospective learning" (learning about future tasks).