Pareto Continual Learning: A Multi-Objective Approach

Updated 11 March 2026

Pareto Continual Learning (ParetoCL) is a framework that redefines continual learning via multi-objective optimization to balance stability and plasticity.
It employs a preference-conditioned model architecture with a shared encoder and hypernetwork to dynamically adjust trade-offs at inference.
Empirical results on Seq-CIFAR10, Seq-CIFAR100, and Seq-TinyImageNet demonstrate superior anytime accuracy and computational efficiency compared to prior methods.

Pareto Continual Learning (ParetoCL) is a paradigm and algorithmic framework that addresses the continual learning problem through the lens of multi-objective optimization. The primary concern in continual learning is to strike a dynamic balance between retaining knowledge from previously encountered tasks (stability) and effectively adapting to new tasks (plasticity). ParetoCL operationalizes this stability-plasticity dilemma as a formal multi-objective problem, enabling the learning of a full set of Pareto-optimal solutions and supporting dynamic adaptation at inference through a preference-conditioned approach (Lai et al., 30 Mar 2025).

1. Multi-Objective Formulation of Continual Learning

The classical experience replay framework in continual learning maintains two distinct losses at every time step $t$ :

Plasticity loss $L_{\text{plast}}(\theta) \equiv \mathcal{L}_{\text{new}}(f_\theta ; \mathcal{D}_t)$ , i.e., the loss on the current batch from the new task data $\mathcal{D}_t$ .
Stability loss $L_{\text{stab}}(\theta) \equiv \mathcal{L}_{\text{replay}}(f_\theta ; \mathcal{M}_t)$ , i.e., the loss on the memory buffer $\mathcal{M}_t$ holding replay exemplars from past tasks.

ParetoCL formalizes learning as the simultaneous minimization of both objectives: $\min_\theta\, F(\theta) = (f_1(\theta), f_2(\theta))^\top$ where

$f_1(\theta) = L_{\text{stab}}(\theta)$ ,
$f_2(\theta) = L_{\text{plast}}(\theta)$ .

The resulting solution space consists of a Pareto front, with each point representing a particular trade-off between stability and plasticity. Optimizing one objective without regard for the other leads to either catastrophic forgetting or complete immobility, thus necessitating a principled multi-objective approach (Lai et al., 30 Mar 2025).

2. Preference-Conditioned Model Architecture

To avoid training and storing multiple networks for different stability-plasticity trade-offs, ParetoCL employs a single model $f_\theta(x; \alpha)$ , where $\alpha = (\alpha_1, \alpha_2) \in \Delta$ is a preference vector lying on the simplex $\Delta \equiv \{\alpha_1, \alpha_2 \geq 0,\, \alpha_1 + \alpha_2 = 1 \}$ .

The architecture comprises:

A shared encoder $h_\theta(\cdot)$ producing penultimate features $h \in \mathbb{R}^d$ .
A hypernetwork $\Psi$ that, given $\alpha$ , outputs the weights and bias for the final linear layer: $W(\alpha) = \Psi_W(\alpha)$ , $b(\alpha) = \Psi_b(\alpha)$ .
The final prediction is $f_\theta(x; \alpha) = W(\alpha) h + b(\alpha)$ .

$\Psi$ may use concatenation (i.e., $\Psi([\alpha; h])$ ) or FiLM-style conditioning, e.g., applying $\gamma(\alpha) \odot h + \beta(\alpha)$ to modulate representations. This design enables efficient mapping from preferences to specific parameterizations corresponding to different trade-offs along the Pareto front (Lai et al., 30 Mar 2025).

3. Learning Procedure and Approximation of the Pareto Front

The training objective is to cover the Pareto front by learning for a distribution of trade-off preferences. The overall loss is: $\mathcal{L}(\theta) = \mathbb{E}_{\alpha \sim p(\alpha)} \left[ \alpha_1 L_{\text{stab}}(\theta; \alpha) + \alpha_2 L_{\text{plast}}(\theta; \alpha) \right]$ where $p(\alpha)$ is typically uniform (e.g., $\text{Dirichlet}(1,1)$ for two-objective settings).

At each training iteration:

$K$ preference vectors $\alpha^1, \dots, \alpha^K$ are sampled from $p(\alpha)$ .
Shared features for new and replayed data are extracted once.
For each $\alpha^k$ , the hypernetwork computes $W(\alpha^k), b(\alpha^k)$ , and corresponding losses on both $\mathcal{D}_t$ and $\mathcal{M}_t$ are computed.
The total loss across preferences is accumulated and the network (encoder and hypernetwork) is updated jointly.

This scheme ensures that, for each $\alpha$ , the model approximates a Pareto-optimal solution $\theta^*(\alpha)$ across the trade-off spectrum. Algorithmic details, including batch sharing to minimize redundant computation, support tractable scaling (using $K=5$ during training) (Lai et al., 30 Mar 2025).

4. Dynamic Inference and Adaptation

At test time, the optimal stability-plasticity trade-off for each sample is not known a priori. ParetoCL implements a dynamic adaptation mechanism:

For a test input $x$ , shared features $h = h_\theta(x)$ are computed.
$K$ preference vectors $\alpha^1, \dots, \alpha^K$ are sampled (with $K=20$ for inference).
The hypernetwork produces $W(\alpha^k), b(\alpha^k)$ for each preference, yielding logits and softmax probabilities $p^k$ per $\alpha^k$ .
Entropy $H(p^k)$ is calculated for each, and the prediction corresponding to the least uncertain (minimum entropy) trade-off is selected and output.

This per-sample adaptation confers a dynamic and input-conditional balancing between stability and plasticity, outperforming fixed (non-adaptive) or scalarized approaches in empirical evaluations (Lai et al., 30 Mar 2025).

5. Experimental Results and Empirical Analysis

ParetoCL demonstrates state-of-the-art performance on standard sequence-incremental continual learning benchmarks, including Seq-CIFAR10, Seq-CIFAR100, and Seq-TinyImageNet in both online (single epoch/task) and offline settings. Experiments use class-incremental protocols, omitting explicit task IDs at test time.

Key results (online scenario):

Method	Seq-CIFAR10 AAA/Acc	Seq-CIFAR100 AAA/Acc	Seq-TinyImageNet AAA/Acc
ER	52.74 / 33.14	18.70 / 15.12	19.14 / 14.21
DER++	60.63 / 50.33	24.47 / 16.32	19.21 / 13.65
CLSER	61.88 / 49.03	27.83 / 19.45	23.47 / 19.72
OCM	66.22 / 53.89	26.03 / 15.88	17.55 / 8.33
VR-MCL	69.57 / 58.43	30.46 / 22.31	24.41 / 20.55
ParetoCL (O)	70.89 / 59.95	33.04 / 24.45	31.72 / 23.09

Average Anytime Accuracy (AAA) and final average accuracy (Acc) consistently favor ParetoCL across datasets and memory buffer sizes. Ablations reveal that dynamic inference (vs. fixed $\alpha$ ) is critical for maximum performance. Plug-in multi-objective optimizers such as MGDA or Tchebycheff underperform relative to the preference-conditioned approach. Training efficiency is also improved: ParetoCL requires significantly less compute than VR-MCL, with $\approx 224$ s training versus $\approx 1074$ s and higher accuracy (Lai et al., 30 Mar 2025).

6. Connections with Broader Pareto Continual Learning Paradigms

The multi-objective optimization perspective advanced by ParetoCL is becoming foundational in both serial and parallel continual learning. The Elastic Multi-Gradient Descent (EMGD) framework (Lyu et al., 2024) treats parallel continual learning as a dynamic multi-objective problem, ensuring that each update direction aligns with a Pareto descent direction by solving a quadratic program that adapts to the individual progress of each task via elastic factors. EMGD and related methods further generalize the notion of finding Pareto-optimal updates, noting that solutions such as MGDA are often too conservative, whereas naive averaging can be reckless in catastrophic forgetting.

A key distinction is ParetoCL's dynamic adaptation at inference via a learned $\alpha \to \theta$ mapping, in contrast to approaches that seek Pareto solutions only during training. Other methods, including IBCL (Lu et al., 2023), highlight the need for scalable approaches to generate preference-conditional models that span the Pareto front efficiently, without incurring linear training overheads per preference.

A plausible implication is the consolidation of continual learning research around dynamic multi-objective formulations, with architectures and algorithms designed to efficiently populate and traverse the Pareto front with respect to multiple objectives, especially stability and plasticity.

7. Practical Considerations and Extensions

Memory efficiency is achieved in ParetoCL by storing only a modestly sized hypernetwork in addition to the experience replay buffer, avoiding the need for separate networks per trade-off. The preference prior is chosen as Dirichlet(1,1), with five sampled trade-offs per batch sufficient for robust learning; twenty are used at inference for fine-grained adaptation. A ResNet-18 encoder, SGD optimizer, and learning rate $\eta=0.05$ are typical.

Extensions to more than two objectives are straightforward: increasing the dimension of $\alpha$ and the corresponding hypernetwork capacity enables the approach to handle arbitrary numbers of conflicting continual learning objectives. Conditioning mechanisms can also be enriched, for example, by FiLM-modulating intermediate network blocks or integrating $\alpha$ into batch normalization statistics.

The predominant computational cost is the replay buffer size rather than the preference-conditioning components, positioning ParetoCL as a practically scalable solution for large-scale continual learning scenarios (Lai et al., 30 Mar 2025).