Papers
Topics
Authors
Recent
Search
2000 character limit reached

Pareto Continual Learning: A Multi-Objective Approach

Updated 11 March 2026
  • Pareto Continual Learning (ParetoCL) is a framework that redefines continual learning via multi-objective optimization to balance stability and plasticity.
  • It employs a preference-conditioned model architecture with a shared encoder and hypernetwork to dynamically adjust trade-offs at inference.
  • Empirical results on Seq-CIFAR10, Seq-CIFAR100, and Seq-TinyImageNet demonstrate superior anytime accuracy and computational efficiency compared to prior methods.

Pareto Continual Learning (ParetoCL) is a paradigm and algorithmic framework that addresses the continual learning problem through the lens of multi-objective optimization. The primary concern in continual learning is to strike a dynamic balance between retaining knowledge from previously encountered tasks (stability) and effectively adapting to new tasks (plasticity). ParetoCL operationalizes this stability-plasticity dilemma as a formal multi-objective problem, enabling the learning of a full set of Pareto-optimal solutions and supporting dynamic adaptation at inference through a preference-conditioned approach (Lai et al., 30 Mar 2025).

1. Multi-Objective Formulation of Continual Learning

The classical experience replay framework in continual learning maintains two distinct losses at every time step tt:

  • Plasticity loss Lplast(θ)Lnew(fθ;Dt)L_{\text{plast}}(\theta) \equiv \mathcal{L}_{\text{new}}(f_\theta ; \mathcal{D}_t), i.e., the loss on the current batch from the new task data Dt\mathcal{D}_t.
  • Stability loss Lstab(θ)Lreplay(fθ;Mt)L_{\text{stab}}(\theta) \equiv \mathcal{L}_{\text{replay}}(f_\theta ; \mathcal{M}_t), i.e., the loss on the memory buffer Mt\mathcal{M}_t holding replay exemplars from past tasks.

ParetoCL formalizes learning as the simultaneous minimization of both objectives: minθF(θ)=(f1(θ),f2(θ))\min_\theta\, F(\theta) = (f_1(\theta), f_2(\theta))^\top where

  • f1(θ)=Lstab(θ)f_1(\theta) = L_{\text{stab}}(\theta),
  • f2(θ)=Lplast(θ)f_2(\theta) = L_{\text{plast}}(\theta).

The resulting solution space consists of a Pareto front, with each point representing a particular trade-off between stability and plasticity. Optimizing one objective without regard for the other leads to either catastrophic forgetting or complete immobility, thus necessitating a principled multi-objective approach (Lai et al., 30 Mar 2025).

2. Preference-Conditioned Model Architecture

To avoid training and storing multiple networks for different stability-plasticity trade-offs, ParetoCL employs a single model fθ(x;α)f_\theta(x; \alpha), where α=(α1,α2)Δ\alpha = (\alpha_1, \alpha_2) \in \Delta is a preference vector lying on the simplex Δ{α1,α20,α1+α2=1}\Delta \equiv \{\alpha_1, \alpha_2 \geq 0,\, \alpha_1 + \alpha_2 = 1 \}.

The architecture comprises:

  • A shared encoder hθ()h_\theta(\cdot) producing penultimate features hRdh \in \mathbb{R}^d.
  • A hypernetwork Ψ\Psi that, given α\alpha, outputs the weights and bias for the final linear layer: W(α)=ΨW(α)W(\alpha) = \Psi_W(\alpha), b(α)=Ψb(α)b(\alpha) = \Psi_b(\alpha).
  • The final prediction is fθ(x;α)=W(α)h+b(α)f_\theta(x; \alpha) = W(\alpha) h + b(\alpha).

Ψ\Psi may use concatenation (i.e., Ψ([α;h])\Psi([\alpha; h])) or FiLM-style conditioning, e.g., applying γ(α)h+β(α)\gamma(\alpha) \odot h + \beta(\alpha) to modulate representations. This design enables efficient mapping from preferences to specific parameterizations corresponding to different trade-offs along the Pareto front (Lai et al., 30 Mar 2025).

3. Learning Procedure and Approximation of the Pareto Front

The training objective is to cover the Pareto front by learning for a distribution of trade-off preferences. The overall loss is: L(θ)=Eαp(α)[α1Lstab(θ;α)+α2Lplast(θ;α)]\mathcal{L}(\theta) = \mathbb{E}_{\alpha \sim p(\alpha)} \left[ \alpha_1 L_{\text{stab}}(\theta; \alpha) + \alpha_2 L_{\text{plast}}(\theta; \alpha) \right] where p(α)p(\alpha) is typically uniform (e.g., Dirichlet(1,1)\text{Dirichlet}(1,1) for two-objective settings).

At each training iteration:

  1. KK preference vectors α1,,αK\alpha^1, \dots, \alpha^K are sampled from p(α)p(\alpha).
  2. Shared features for new and replayed data are extracted once.
  3. For each αk\alpha^k, the hypernetwork computes W(αk),b(αk)W(\alpha^k), b(\alpha^k), and corresponding losses on both Dt\mathcal{D}_t and Mt\mathcal{M}_t are computed.
  4. The total loss across preferences is accumulated and the network (encoder and hypernetwork) is updated jointly.

This scheme ensures that, for each α\alpha, the model approximates a Pareto-optimal solution θ(α)\theta^*(\alpha) across the trade-off spectrum. Algorithmic details, including batch sharing to minimize redundant computation, support tractable scaling (using K=5K=5 during training) (Lai et al., 30 Mar 2025).

4. Dynamic Inference and Adaptation

At test time, the optimal stability-plasticity trade-off for each sample is not known a priori. ParetoCL implements a dynamic adaptation mechanism:

  1. For a test input xx, shared features h=hθ(x)h = h_\theta(x) are computed.
  2. KK preference vectors α1,,αK\alpha^1, \dots, \alpha^K are sampled (with K=20K=20 for inference).
  3. The hypernetwork produces W(αk),b(αk)W(\alpha^k), b(\alpha^k) for each preference, yielding logits and softmax probabilities pkp^k per αk\alpha^k.
  4. Entropy H(pk)H(p^k) is calculated for each, and the prediction corresponding to the least uncertain (minimum entropy) trade-off is selected and output.

This per-sample adaptation confers a dynamic and input-conditional balancing between stability and plasticity, outperforming fixed (non-adaptive) or scalarized approaches in empirical evaluations (Lai et al., 30 Mar 2025).

5. Experimental Results and Empirical Analysis

ParetoCL demonstrates state-of-the-art performance on standard sequence-incremental continual learning benchmarks, including Seq-CIFAR10, Seq-CIFAR100, and Seq-TinyImageNet in both online (single epoch/task) and offline settings. Experiments use class-incremental protocols, omitting explicit task IDs at test time.

Key results (online scenario):

Method Seq-CIFAR10 AAA/Acc Seq-CIFAR100 AAA/Acc Seq-TinyImageNet AAA/Acc
ER 52.74 / 33.14 18.70 / 15.12 19.14 / 14.21
DER++ 60.63 / 50.33 24.47 / 16.32 19.21 / 13.65
CLSER 61.88 / 49.03 27.83 / 19.45 23.47 / 19.72
OCM 66.22 / 53.89 26.03 / 15.88 17.55 / 8.33
VR-MCL 69.57 / 58.43 30.46 / 22.31 24.41 / 20.55
ParetoCL (O) 70.89 / 59.95 33.04 / 24.45 31.72 / 23.09

Average Anytime Accuracy (AAA) and final average accuracy (Acc) consistently favor ParetoCL across datasets and memory buffer sizes. Ablations reveal that dynamic inference (vs. fixed α\alpha) is critical for maximum performance. Plug-in multi-objective optimizers such as MGDA or Tchebycheff underperform relative to the preference-conditioned approach. Training efficiency is also improved: ParetoCL requires significantly less compute than VR-MCL, with 224\approx 224s training versus 1074\approx 1074s and higher accuracy (Lai et al., 30 Mar 2025).

6. Connections with Broader Pareto Continual Learning Paradigms

The multi-objective optimization perspective advanced by ParetoCL is becoming foundational in both serial and parallel continual learning. The Elastic Multi-Gradient Descent (EMGD) framework (Lyu et al., 2024) treats parallel continual learning as a dynamic multi-objective problem, ensuring that each update direction aligns with a Pareto descent direction by solving a quadratic program that adapts to the individual progress of each task via elastic factors. EMGD and related methods further generalize the notion of finding Pareto-optimal updates, noting that solutions such as MGDA are often too conservative, whereas naive averaging can be reckless in catastrophic forgetting.

A key distinction is ParetoCL's dynamic adaptation at inference via a learned αθ\alpha \to \theta mapping, in contrast to approaches that seek Pareto solutions only during training. Other methods, including IBCL (Lu et al., 2023), highlight the need for scalable approaches to generate preference-conditional models that span the Pareto front efficiently, without incurring linear training overheads per preference.

A plausible implication is the consolidation of continual learning research around dynamic multi-objective formulations, with architectures and algorithms designed to efficiently populate and traverse the Pareto front with respect to multiple objectives, especially stability and plasticity.

7. Practical Considerations and Extensions

Memory efficiency is achieved in ParetoCL by storing only a modestly sized hypernetwork in addition to the experience replay buffer, avoiding the need for separate networks per trade-off. The preference prior is chosen as Dirichlet(1,1), with five sampled trade-offs per batch sufficient for robust learning; twenty are used at inference for fine-grained adaptation. A ResNet-18 encoder, SGD optimizer, and learning rate η=0.05\eta=0.05 are typical.

Extensions to more than two objectives are straightforward: increasing the dimension of α\alpha and the corresponding hypernetwork capacity enables the approach to handle arbitrary numbers of conflicting continual learning objectives. Conditioning mechanisms can also be enriched, for example, by FiLM-modulating intermediate network blocks or integrating α\alpha into batch normalization statistics.

The predominant computational cost is the replay buffer size rather than the preference-conditioning components, positioning ParetoCL as a practically scalable solution for large-scale continual learning scenarios (Lai et al., 30 Mar 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Pareto Continual Learning (ParetoCL).