Horizontal Continual Learning
- Horizontal continual learning is a paradigm where models adapt to sequential, evolving data distributions without explicit task boundaries.
- It employs replay, regularization, and modular architectures to balance adapting to new data while preserving previously learned knowledge.
- Evaluation metrics like backward and forward transfer reveal tradeoffs between rapid adaptation and long-term retention across tasks.
Horizontal continuity (“horizontal continual learning,” also referred to as domain-incremental learning or horizontal CL) is a paradigm in continual learning where models are exposed to a sequence of data distributions or tasks over time, with a fixed output (label) space and without explicit task boundaries. The learning system must continually adapt to non-stationary input distributions while retaining its performance on previously observed data, thus combating catastrophic forgetting. Unlike “vertical” continual learning, which progresses from general to increasingly specialized tasks, horizontal continuity emphasizes model robustness across a “horizontal” chain of similar-scale domains, temporal increments, or environments.
1. Formal Definitions and Conceptual Foundations
Horizontal continual learning is characterized by the following properties:
- Fixed Label Space: The set of output classes remains constant across time or task increments.
- Dynamics of Distribution Shift: The joint distribution evolves over timesteps , i.e., the learner observes a sequence with and .
- Absence of Task Identifiers: No explicit task IDs are provided; the agent must learn to generalize across time and distributional changes without side information indicating task boundaries (Cai et al., 2021).
- Incremental Model Updates: At each , the model predicts , incurs a loss, and updates parameters within strict memory/compute bounds.
Performance metrics (as instantiated in (Cai et al., 2021, Shi et al., 2024)):
- Cumulative average accuracy: .
- Backward transfer: .
- Forward transfer: .
- Horizontal forgetting up to step : (Shi et al., 2024).
The central challenge is to preserve knowledge retention (low forgetting) while maintaining adaptation to new, potentially unexpected, data distributions over time.
2. Benchmark Datasets and Evaluation Protocols
Horizontal continual learning is evaluated using benchmarks that induce natural or synthetic non-stationarity in the data stream without explicit task demarcation.
- CLOC (Continual LOCalization) (Cai et al., 2021): A large-scale visual benchmark with 39M geolocated images temporally ordered (2004–2014), 712 S2 spatial cells used as classes, highly nonstationary due to real-world temporal/geographical shift. Each time step ingests a small user-album; evaluation involves “test-then-train” cycles, real-world distribution drift, and no explicit task markers.
- Temporal and domain-adaptive LLM benchmarks (Shi et al., 2024): Evolving corpora such as quarterly Twitter streams (TimeLMs), Wikipedia snapshots (TemporalWiki), shifting from one discipline or language to another (news → legal, English → Norwegian → Icelandic).
- Horizontal Continual NLP (Michieli et al., 2024): Sequences of tasks (e.g., aspect or document-level sentiment classification across many domains), where each problem constitutes either a new domain or task.
Evaluation protocol: In all cases, metrics must be tracked online and plotted as a function of time or assignment index; assessment must include both instantaneous efficacy and post-hoc retention (backward transfer).
3. Algorithmic Strategies for Horizontal CL
The algorithmic approaches for horizontal continuity span a diverse spectrum:
- Replay-based methods: Maintain a buffer of past examples for rehearsal (ER, Mix-Review, Dark Experience Replay), often with buffer size dynamically adapted (e.g., ADRep (Cai et al., 2021)) to balance overfitting and underfitting.
- Parameter regularization: Constrain updates to avoid deviating from parameters deemed important for previous data using Fisher information (EWC, MAS), per-parameter penalties, or functional regularization.
- Architectural isolation/expansion: Allocate new modules (adapters, domain experts) for each domain/time increment, freezing or reusing others; examples include LLM adapters (Shi et al., 2024), parameter-efficient adapters in NLP (Michieli et al., 2024), or modular CNN units (Berjaoui, 2020).
- Optimization and learning-rate adaptation: Online learning efficacy is sensitive to optimizer hyperparameters. Adaptive schedules such as PoLRS (Cai et al., 2021), or learning-rate search, support rapid adaptation while slow schedules (annealing, cosine) favor retention (Cai et al., 2021).
- Flatness-based constraints: Build overlapping “flat” regions in weight space per task (C&F), enforcing stepwise optimization (minimax local flattening, Fisher-based regularization) to ensure low-loss regions persist across tasks (Shi et al., 2023).
- Orthogonalization and geometric coding: Project gradient updates into the nullspace of previous representations (HSIC-Bottleneck Orthogonalization (Li et al., 2024)), or separate the decision boundary via fixed equiangular geometries (Equiangular Embedding) to block forgetting without architectural growth (Li et al., 2024).
- Generative modeling for replay: Use generative models (normalizing flows) to sample synthetic replay data, or employ Bayesian anomaly detection for online task assignment (Kirichenko et al., 2021).
4. Theoretical Insights and Empirical Phenomena
Flatness and Overlap: Overlapping flat regions in the loss landscape are essential for horizontal continuity. The C&F framework creates local flat optima and regularizes towards overlapping parameter spaces, reducing both forgetting and intransigence. Empirically, only joint create+find schemes maintain low Hessian spectra and parameter overlap (Shi et al., 2023).
Linear Mode Connectivity: Minima found via multitask and well-regularized continual learning (with shared initialization) can be connected by low-loss linear paths. Mode-Connectivity SGD (MC-SGD) enforces such properties and closes the gap between continual and multitask solutions (Mirzadeh et al., 2020). Without this, continual minima often reside in disconnected regions, resulting in significant loss barriers and forgetting.
Tradeoffs in Optimization:
- Small or annealed learning rates protect knowledge but slow adaptation.
- Large/adaptive rates accelerate fit but degrade retention; hence population-based schemes can balance this (Cai et al., 2021).
- Larger batch sizes, unlike in iid supervised learning, worsen online fit and retention due to increased bias toward past distributions (Cai et al., 2021).
Limits of Representation Sharing: With linear feature extractors and orthogonal projected gradients, provable guarantees of both adaptation and zero-forgetting are possible (DPGrad) (Peng et al., 2022). For general non-linear representations without replay or expansion, such no-forgetting guarantees are impossible even information-theoretically: some tasks fundamentally interfere—all known methods either restrict capacity, grow model size, or store exemplars.
5. Instantiations Across Modalities and Architectures
LLMs (Shi et al., 2024):
- Horizontal continuity for LLMs is operationalized through episodic replay, regularization targeting prior-domain knowledge, modular parameter allocation (adapters, LoRA, domain experts), and optimization-based sample selection.
- Benchmark tasks include sequential domain adaptation (news → law, English → Norwegian), temporal evolution (Wikipedia), continual fine-tuning (question answering task streams).
- Architectural and hybrid approaches (e.g., RecAdam, DEMix, Orthogonal LoRA) are widely used to balance plasticity and stability.
- Evaluation employs performance matrices , horizontal forgetting , forward transfer metrics, and LAMA/TRACE probes for factual knowledge retention.
Vision (e.g., CLOC, HRN) (Cai et al., 2021, Berjaoui, 2020):
- Horizontal CL in vision is instantiated via timestamped and location-shifted dataset streams, modular CNN blocks with orthogonal feature hashing to isolate task representations, and minimal or no data replay.
- Dynamic allocation of sub-networks or hashing-based unit expansion ensures both adaptability and knowledge persistence while avoiding catastrophic forgetting.
NLP and Modular Adapters (Michieli et al., 2024):
- Horizontal CL in NLP leverages frozen backbones with per-domain/task adapters and high-order pooling to encode token-level distributional shift.
- Adaptation occurs by initializing each new task’s adapter from the previous, with only modest parameter growth, and achieves empirically superior transfer/retention on diverse benchmarks.
6. Open Problems, Challenges, and Best Practices
- Memory and Compute Constraints: Balancing buffer size for replay (too small → overfit, too large → dilute new data), parameter modularity, and real-world resource limits is critical (Cai et al., 2021, Shi et al., 2024).
- Evaluation and Benchmarks: Lack of standardized, large-scale, cross-domain/multimodal benchmarks impedes systematic progress (Shi et al., 2024). Metrics for knowledge acquisition, retention, and update (e.g., FUAR, X-Delta) are still evolving.
- Theoretical Understanding: Most generalization bounds target static or vertical CL; horizontal CL, especially for LLMs or deep non-linear nets, is theoretically underdeveloped (Shi et al., 2024, Peng et al., 2022).
- Catastrophic Forgetting Mitigation: All mechanisms—replay, regularization, modularity, controlled optimization—must be tuned to both task sequence and memory/computation tradeoffs.
- Practical Recipes: Empirically supported strategies for strong horizontal CL include: small batch sizes, population-based learning-rate schedules, adaptive replay buffer sizing, flatness-based regularization, and modular parameter growth or freezing (Cai et al., 2021, Shi et al., 2023, Berjaoui, 2020, Michieli et al., 2024).
In summary, horizontal continuity is both a unifying conceptual direction and a set of algorithmic, architectural, and empirical strategies for continual learning across non-stationary data streams. Central to its success are principled mechanisms for balancing plasticity and stability, modularizing or protecting parameters, adaptively leveraging memory and compute resources, and directly evaluating online learning, backward, and forward transfer over non-iid, ever-evolving input sequences (Cai et al., 2021, Shi et al., 2024, Shi et al., 2023, Li et al., 2024, Berjaoui, 2020, Mirzadeh et al., 2020, Kirichenko et al., 2021, Peng et al., 2022, Michieli et al., 2024).