- The paper establishes that methods from adaptive filtering, such as LMS and APA, provide a solid mathematical foundation to balance learning new data with retaining previous knowledge.
- It demonstrates that iterative algorithms like LMS converge exponentially fast under repeated tasks and that gradient projection methods in deep networks prevent forgetting by projecting updates onto null spaces.
- The study reveals that recursive approaches, including RLS and Kalman Filter, not only average past solutions but also offer positive backward transfer, underpinning the connection between linear models and nonlinear adaptations.
This paper provides a tutorial overview of the mathematical connections between continual learning (CL) and adaptive filtering (AF), arguing that insights from AF can provide a stronger mathematical foundation for understanding and developing CL methods. The core idea is that both fields grapple with the problem of learning from sequential data while balancing new information with previously acquired knowledge.
Continual Learning and Adaptive Filtering
Continual learning aims to train a machine learning model on a sequence of tasks without suffering from catastrophic forgetting, where performance on previous tasks degrades significantly upon learning new ones. Modern deep learning approaches for CL include:
- Replay-based methods: Store a subset of past data and replay it during training on new tasks.
- Constrained optimization methods: Learn new tasks while ensuring performance on old tasks remains above a certain threshold, often formulated as optimization problems with constraints.
- Regularization-based methods: Add penalty terms to the objective function to keep the model parameters close to their values for previous tasks.
- Expansion-based methods: Add new parameters or network components for each new task.
Adaptive filtering, a classic field in signal processing, focuses on updating model parameters online as new data arrives. Early examples, like the Least Mean Squares (LMS) algorithm, embodied a "minimal disturbance principle" – adapting to reduce the current error while minimizing disruption to existing knowledge – which closely aligns with the goals of continual learning.
Despite differences (CL often deals with nonlinear models and uncorrelated task data, while classic AF focuses on linear models and temporally correlated data), the paper argues for deep connections based on three observations:
- Theoretical analysis of CL methods often relies on simplified linear models, where connections to AF become apparent.
- Core AF methods (LMS, APA, RLS, KF) can be extended to handle data without strong temporal correlation, fitting the CL setting.
- These AF methods can be adapted for nonlinear models and deep networks through linearization, layer-wise application, or focusing on training linear layers on features from pre-trained models.
The paper then reviews specific AF methods and their CL counterparts, highlighting these connections.
Least Mean Squares (LMS)
In the context of sequential tasks with one sample per task, LMS updates the model θt−1 to θt to fit the current sample (xt,yt) while staying close to θt−1. Mathematically, it can be seen as a constrained optimization problem:
θmin∥θ−θt−1∥22s.t.yt−xt⊤θ=(1−γ)(yt−xt⊤θt−1)
This leads to the familiar SGD-like update: θt=θt−1−γ(xt⊤θt−1−yt)xt.
For linear models (yi=xi⊤θ∗) and normalized inputs (∥xi∥2=1), the squared error on task i using the model from task j is ϵij2=(yi−xi⊤θj)2. Continual learning aims to keep the mean squared error (MSE) Et=t1∑i=1tϵit2 low.
The paper shows that for i.i.d. data or cyclically repeated tasks, LMS converges to the true model θ∗ exponentially fast, meaning the MSE decreases over tasks. For example, under a 2-recurring task sequence, the MSE at task T decays as O(cT−1), where c=(x1⊤x2)2 (2504.17963). Optimal stepsizes can even achieve perfect recovery after a few tasks (2504.17963). This theoretical analysis shows that LMS performs well when tasks are revisited, implicitly mitigating forgetting through repeated exposure.
Affine Projection Algorithm (APA)
APA extends LMS by using a memory buffer of size b containing the b most recent samples. For the current task t, it solves:
θmin∥θ−θt−1∥22s.t.yt−i=xt−i⊤θ,i=0,…,b
This projects the previous model onto the solution space of the current and b most recent tasks.
A special case, APA†, uses all past data (b=t−1). The paper proves that APA† (starting from θ0=0) implicitly solves the minimum-norm problem:
θmin∥θ∥22s.t.y:t=X:t⊤θ
This highlights APA†'s implicit bias towards the minimum-norm solution consistent with all seen data (2504.17963).
The Ideal Continual Learner (ICL) framework [Peng-ICML2023] for linear models is formulated as minimizing the current task's loss subject to previous tasks being solved optimally:
θmin(yt−xt⊤θ)2s.t.y:t−1=X:t−1⊤θ
The paper shows that an online update derived for ICL is equivalent to APA†, further solidifying the connection and its implicit bias towards the minimum-norm solution (2504.17963). This update uses the current sample xt projected onto the null space of previous data X:t−1⊤.
Gradient Projection (GP) methods for deep networks [Saha-ICLR2021, Wang-CVPR2021] are presented as extensions of these ideas. GP updates layer parameters Θℓ by projecting the standard update direction (Δtℓ) onto the null space of previously seen output features from the layer below (Pt−1ℓ−1Δtℓ). The paper proves that this ensures previous outputs fθtℓ(xi) remain unchanged for all i<t, thus preventing forgetting on past inputs (2504.17963). Challenges arise in practice when feature spaces grow, making exact projections computationally expensive or impossible. Approximations (like low-rank projections) lose the exact non-forgetting guarantee, illustrating a trade-off. The idea can also be applied to nonlinear models by linearizing around the current estimate.
Recursive Least-Squares (RLS)
RLS minimizes a weighted sum of squared errors across all past tasks, with weights typically decaying exponentially via a forgetting factor β∈(0,1]:
θminλ∥θ∥22+i=1∑tβi(yi−xi⊤θ)2
Online updates for the model θt and the inverse Hessian Φt are derived. The paper shows that RLS with β→0 becomes equivalent to the minimum-norm problem and the APA†/ICL family (2504.17963). When a single true model θ∗ doesn't exist for all tasks, RLS computes a weighted average of task solutions. If task solutions are far apart, this average may perform poorly on all tasks.
RLS concepts are applied to deep networks through layer-wise updates (e.g., Enhanced Back Propagation, Orthogonal Weight Modification [Zeng-NMI2019]) or by training a dynamically expanding linear classifier on features from a pre-trained model [Azimi-TNN1993, Zhuang-NeurIPS2022]. While empirical benefits are shown, theoretical non-forgetting guarantees for RLS with arbitrary β are less clear than for GP.
Kalman Filter (KF)
KF addresses a more general linear Gaussian model setting where each task i has its own "true model" θi, and these models evolve over time according to a linear transition model (θi=Aiθi−1+wi) with Gaussian noise. Measurements for task i relate to θi linearly (yi=Xi⊤θi+vi) with Gaussian noise, potentially involving multiple samples per task.
The goal is to estimate the sequence of task models θ1,…,θt. Using the Maximum A Posteriori (MAP) principle, this amounts to estimating the conditional means E[θi∣y:t] for i≤t. KF recursively computes the mean θt∣t and covariance Σt∣t for the current task t given all data seen so far (y:t) via prediction and correction steps.
The paper shows that RLS is a special case of KF when the task transition matrix Ai=I and the state noise wi=0, meaning θi=θi−1 (a single true model) and the noise vi on measurements has a specific structure related to β.
A key advantage of the KF framework is the concept of positive backward transfer. While KF computes θt∣t, the Rauch-Tung-Striebel (RTS) smoother can be used to compute θi∣t for i<t. The paper proves that conditioning on more data (up to task t instead of task s<t) results in estimates with smaller error covariance: Σi∣t⪯Σi∣s for i≤s<t. This means the estimate for a past task i improves as more subsequent tasks are learned, formally demonstrating positive backward transfer under the linear Gaussian model (2504.17963).
Extensions of KF to nonlinear models (Extended Kalman Filter - EKF), layer-wise applications, and training linear classifiers on pre-trained features are also discussed.
Conclusion
The paper concludes by emphasizing that adaptive filtering theory provides a strong mathematical lens for understanding and advancing continual learning. Key AF principles like the minimal disturbance principle and the structure of recursive updates are highly relevant to CL algorithms designed to mitigate forgetting. The connections between LMS, APA, RLS, and KF and various CL methods (SGD for linear regression, ICL, Gradient Projection, layer-wise updates, methods using pre-trained models) are consolidated, highlighting how memory (explicit or implicit through projections/covariances) and task relationships play crucial roles. The paper suggests future research directions based on these connections, such as exploring other AF areas, developing theoretical guarantees for nonlinear deep continual learning, and studying "prospective learning" (learning about future tasks).