Horizontal Continual Learning

Updated 26 June 2026

Horizontal continual learning is a framework where models learn from a sequential stream of tasks or domains without explicit task identifiers while preserving previous knowledge.
It tackles challenges like catastrophic forgetting and the stability–plasticity dilemma using methods such as replay-based strategies, parameter isolation, and context routing.
The approach underpins practical applications in domain-, class-, and task-incremental learning, enhancing LLMs and federated systems for robust, scalable adaptation in dynamic environments.

Horizontal continual learning refers to the paradigm in which a model is exposed to a temporal or domain-wise sequence of tasks, domains, or class increments, requiring it to accumulate new competencies over time while retaining performance on previously learned distributions—without explicit task-ID signals or access to the full set of past data. This setting, also known as domain-incremental, class-incremental, or task-incremental continual learning, is fundamental for scalable machine learning agents operating under non-stationary, open-world conditions. Core technical challenges include catastrophic forgetting, efficient memory utilization, task detection under latent shifts, and balancing stability–plasticity trade-offs across the sequence.

1. Formal Scope and Definitions

Horizontal continual learning is mathematically framed as learning a parametrized function $h$ mapping input space $\mathcal X$ to output space $\mathcal Y$ , over a sequence of $T$ tasks or domains. Each task $t$ supplies a dataset $S_t$ from a distribution $D_t$ on $\mathcal X \times \mathcal Y$ , with $h$ trained incrementally to minimize the risk over the union of all observed tasks: $\min_{h}\;R(h) = \sum_{t=1}^T \mathbb{E}_{(x, y)\sim D_t} [ \ell(h(x), y) ]$ Under the horizontal (as opposed to vertical) continual learning assumption, the domain $\mathcal X$ 0 shifts over time—by changes in $\mathcal X$ 1 or in the support of $\mathcal X$ 2 (class-incremental)—and no task oracle or explicit segmentation signal is available at inference (Shi et al., 2024, Kim et al., 2023, Mori et al., 2022, Bashivan et al., 2019, Ge et al., 2023, Liu et al., 20 Apr 2025).

Key subtypes:

Class-incremental learning (CIL): At each step, new classes are introduced ( $\mathcal X$ 3 disjoint across $\mathcal X$ 4), but at test time, the model must predict over the union $\mathcal X$ 5 (Kim et al., 2023, Liu et al., 20 Apr 2025).
Domain-incremental learning: The input distribution $\mathcal X$ 6 changes but label semantics remain fixed (Kirichenko et al., 2021, Bashivan et al., 2019).
Task-incremental learning: Multiple distinct input/output mappings, with an explicit task identifier at test time (Ge et al., 2023).

For LLMs, the same principle applies: continual adaptation over time- or domain-partitioned text corpora without catastrophically forgetting prior knowledge (Shi et al., 2024).

2. Catastrophic Forgetting and Stability–Plasticity

A recurrent obstacle in horizontal continual learning is catastrophic forgetting—the tendency of parametric models, especially deep neural networks, to overwrite previously acquired representations when exposed to new domains or classes. This problem is not mitigated by simply appending new outputs (as in class-incremental) or interleaving new data, due to the inherent non-stationarity. The central stability–plasticity dilemma arises: stability demands conservation of weights critical to old tasks, while plasticity requires flexible acquisition of new knowledge (Lavda et al., 2018, Ge et al., 2023, Kim et al., 2023).

Approaches to manage this trade-off include architectural parameter isolation (e.g., immutable backbones plus lightweight task-specific controllers), generative or functional replay (simulated past data), and regularization-based schemes (EWC, SI, LwF) that constrain updates to regions of parameter space crucial for prior tasks (Lavda et al., 2018, Kirichenko et al., 2021, Bashivan et al., 2019, Ge et al., 2023, Liu et al., 20 Apr 2025).

3. Algorithmic Methods and Frameworks

Research has yielded multiple algorithmic classes for horizontal continual learning:

Replay-based: Explicit storage (or generative modeling) of past examples, supporting interleaved training (e.g., generative replay in CCL-GM (Lavda et al., 2018), functional replay in HCL (Kirichenko et al., 2021), discriminative replay with OOD marking in MORE (Kim et al., 2023); semi-parametric memory replay in BrainCL (Liu et al., 20 Apr 2025)).
Parameter-isolation/Expansion: Division of model parameters into shared and task-unique sets—e.g., Channel-wise Lightweight Reprogramming (CLR), which freezes a task-agnostic “anchor” backbone and trains compact, per-task reprogramming matrices for new domains, enabling perfect stability at minimal storage growth (Ge et al., 2023). Supermasks in Superposition (SupSup) and Hard Attention to the Task (HAT) similarly isolate parameter subsets (Kim et al., 2023).
Context Routing: Mechanisms such as Self-Organizing Map–guided masking (SOMLP) allocate network capacity adaptively by clustering inputs into latent task contexts and gating updates, thus rendering the system memoryless and robust to unknown task boundaries (Bashivan et al., 2019).
Hybrid Generative–Discriminative Models: HCL employs invertible flows for joint likelihood estimation, enabling both task detection via typicality analysis and replay/functional regularization for retention (Kirichenko et al., 2021).
Federated and Heterogeneous Data Schemes: Continual Horizontal Federated Learning (CHFL) splits client networks into shared (FedAvg-trained) and unique (locally trained) feature columns, integrating lateral connections to maximize cross-client transfer without privacy compromise (Mori et al., 2022).
Principled Decomposition: Recent theory advocates precise decomposition of CIL into within-task prediction (WP) and task-Prediction (TP, equivalent to closed-world OOD detection), showing both are necessary and sufficient for minimal CIL cross-entropy loss (Kim et al., 2023).
LLMs and Modularization: For large-scale LLMs, horizontal continual learning is operationalized via replay, regularization, parameter-efficient adaptation (adapters/LoRA experts), and modular expansion (e.g., DEMix, K-Adapter), across continual pretraining, specialized domain-adaptive pretraining, and fine-tuning with or without instruction (Shi et al., 2024).

4. Architectures, Optimization, and Practical Algorithms

Table: Representative Frameworks for Horizontal Continual Learning

Approach	Architecture/Module	Key Principle
Replay	Buffer/Generative model	Simulates/rehydrates old data
Parameter-iso	Anchor+task modules (CLR)	Zero overwriting/stable base
Context-routing	SOMLP	Competitive allocation
Lateral transfer	CHFL columns (FedAvg+loc)	Efficient cross-client reuse
Modularization	LLM experts/adapters	Parameter-efficient expansion

Replay-based and generative approaches maintain performance on earlier tasks by interleaving reconstructed or buffered data during training. Parameter-isolation and modular reprogramming (e.g., CLR, SupSup) maximize theoretical and empirical task decoupling, preventing old-task performance drift at the expense of parameter growth. Methods such as CHFL exploit distributed, heterogeneous environments by partitioning feature spaces and leveraging continual learning to integrate private information without global interference.

Losses and regularization terms vary: generative continual models optimize a joint ELBO (reconstruction/classification bound) and may include posterior consistency or information gain constraints (Lavda et al., 2018, Kirichenko et al., 2021). CHFL coordinates parallel optimization over shared/federated and private columns, with lateral connection hyperparameters controlling forward knowledge transfer (Mori et al., 2022). BrainCL executes two-stage wake-sleep consolidation leveraging nonparametric cues and parametric decoders (Liu et al., 20 Apr 2025). In CIL theory, additive cross-entropy losses quantify within-task and task-identification error; optimizing both yields provably minimal overall forgetting and maximal recall (Kim et al., 2023).

5. Benchmark Protocols and Empirical Insights

Horizontal continual learning is assessed on domain and class-incremental classification benchmarks (e.g., Permuted/Rotated MNIST (Bashivan et al., 2019, Lavda et al., 2018), SVHN→MNIST (Kirichenko et al., 2021), CIFAR-10/100 splits, ImageNet-100 (Liu et al., 20 Apr 2025), 53-task object bench (Ge et al., 2023)), and large-scale LLMs (TimeLMs, TemporalWiki, DEMix, CITB (Shi et al., 2024)). Metrics include average and per-task accuracy, forgetting, forward transfer, and OOD detection AUC.

Empirical highlights:

Generative replay in CCL-GM achieves ~97% average accuracy in 5-task Permuted-MNIST, outperforming EWC and non-replay baselines (Lavda et al., 2018).
CHFL surpasses vanilla HFL and local-per-client training in federated heterogeneous benchmarks by several percentage points; optimal performance requires moderate correlation between shared and unique features (Mori et al., 2022).
CLR achieves ~86% average accuracy across 53 vision tasks with <0.6% additional parameters per task and zero forgetting of initial tasks (Ge et al., 2023).
SOMLP, using no data buffer or explicit task signals, matches or exceeds EWC/GEM in memory-constrained regimes on MNIST-Permutation/Rotation (Bashivan et al., 2019).
In CIL for LLMs, parameter-efficient modularization (DEMix, K-Adapter), judicious replay, and contrastive replay selection each contribute to minimizing horizontal forgetting while maintaining upstream transfer (Shi et al., 2024).
Wake-sleep BrainCL retains 78.7% average, 74.2% final accuracy on ImageNet-100 with tight memory budgets, outperforming raw-image replay and parameter expansion under same storage (Liu et al., 20 Apr 2025).

6. Theoretical Developments and Open-World Perspectives

Recent theoretical results decompose horizontal continual learning (closed- and open-world CIL) losses into the sum of within-task and task-identification (OOD detection) cross-entropies, with necessary and sufficient conditions showing both are required for minimax continual risk (Kim et al., 2023). Algorithms that strengthen both internal discriminative power and boundary detection properties (e.g., integrating strong OOD detectors with parameter-isolation architectures) are empirically best-in-class.

The open-world extension generalizes CIL: the model must not only classify all previously encountered classes but simultaneously reject (flag as OOD) all unseen classes or domains. This unifies novelty detection and incremental learning, requiring both precise within-domain accuracy and robust detection of semantically non-overlapping inputs. Techniques such as replay-as-OOD, hard attention, and ensemble distance weighting emerge as key instantiations (Kim et al., 2023, Shi et al., 2024).

7. Future Directions and Challenges

Prominent open problems are:

Long-horizon scalability: Most methods are tested on modest task counts; robust algorithms for settings with tens or hundreds of horizontal shifts (as in real-world LLM usage or 50+ domain vision streams) remain an open requirement (Ge et al., 2023, Shi et al., 2024).
Efficient memory/computation: Parameter-isolation and replay scale linearly with task count or buffer size; methods for selective memory, adaptive module reuse, or compression (BrainCL cue coding) are under investigation (Liu et al., 20 Apr 2025).
Boundary-free adaptation: Real-world data rarely provides clear task shifts; dynamic, unsupervised, or latent task/context discovery (SOMLP, HCL) are important themes (Bashivan et al., 2019, Kirichenko et al., 2021).
Unified benchmarks and metrics: Standardized evaluation for horizontal continual pretraining, fine-tuning, and multimodal LLMs is still lacking (Shi et al., 2024).
Theory for large models: Theoretical advances for horizontal continual learning in deep nonlinear and LLM regimes lag empirical work; formalization of transfer and retention in high-dimensional, modular architectures is nascent.