Vertical Continual Learning

Updated 19 February 2026

Vertical continual learning is the process of incrementally adapting general neural networks to increasingly specialized domains while preserving foundational knowledge.
Architectural innovations such as adapter modules, progressive expansion, and hierarchical mixtures enable effective specialization and minimize catastrophic forgetting.
Regularization, replay techniques, and Bayesian approaches are combined to mitigate vertical forgetting, ensuring model stability across adaptation stages.

Vertical continuity, or vertical continual learning, refers to the preservation and refinement of neural network capabilities across hierarchical levels of task specialization, typically by incrementally adapting a general model to progressively more specific domains, tasks, or data modalities while minimizing catastrophic forgetting. This paradigm contrasts with horizontal continual learning, which deals with the sequential acquisition of knowledge across tasks of similar granularity over time or non-i.i.d. environments. The objective of vertical continuity is to enable models to absorb new, specialized knowledge without sacrificing their foundational, general-purpose competencies, using mechanisms involving architectural expansion, regularization, rehearsal, and information-theoretic controls.

1. Conceptual Foundations and Formal Definitions

Vertical continuity describes the process by which a model, often starting from a large, generic backbone (e.g., a pre-trained LLM or vision network), is successively adapted through stages to domains of increasing specificity. In LLMs, this manifests as a multi-phase pipeline: Continual Pre-Training (CPT) on vast, heterogeneous corpora; Domain-Adaptive Pre-Training (DAP) using unlabeled, domain-specific text; and Continual Fine-Tuning (CFT) on narrow, task-oriented, supervised datasets. Each stage imposes its own data distribution, scale, and objective function. The central risk is "vertical forgetting," in which the specialization process degrades the model's initial broad competence (Shi et al., 2024).

Formally, if $\mathcal{S}$ is the original (source) distribution and $\mathcal{T}$ the downstream target, domain adaptation bounds such as

$\epsilon_{\mathcal{T}}(h) \leq \epsilon_{\mathcal{S}}(h) + \tfrac{1}{2} d_{\Delta}(\mathcal{S}, \mathcal{T}) + \lambda$

express the interplay between source error, divergence, and irreducible joint risk. Vertical continual learning aims to bridge $d_{\Delta}$ via staged adaptation while regularizing to preserve $\epsilon_{\mathcal{S}}(h)$ .

In federated and distributed settings, vertical continual learning extends to scenarios where the feature space, label space, or participant identities evolve over time, necessitating collaborative, privacy-preserving mechanisms as well as retention of both previous and current task knowledge (Wang et al., 13 Feb 2025).

2. Architectural Approaches to Vertical Continuity

Numerous works have formalized and instantiated vertical continuity through architectural innovations:

Increasing Network Depth: Progressive addition of new layers for each incoming task, inspired by the Progressive Neural Network (PNN) framework, has been proposed for continual learning (Kozal et al., 2022). The method dynamically expands only relevant portions of the architecture, creating a tree-like parameter-sharing structure where each node corresponds to a set of parameters dedicated to a specific task. This design enables forward transfer and adaptation of prior representations while guaranteeing no forgetting by architectural isolation.
Parameter-Efficient Expansion: In LLMs, adapter modules, LoRA factorizations, and mixture-of-experts layers are appended at each adaptation stage. These expansions allow new capabilities to be learned in small parameter subspaces without overwriting the core model weights (Shi et al., 2024).
Hierarchically Structured Mixture-of-Experts: The Mixture-of-Variational-Experts (MoVE) layer, as introduced in Hierarchical VCL (Hihn et al., 2022), enforces vertical continuity by constructing a compositional architecture in which newly arrived tasks can leverage new "paths" (distinct expert combinations) through the network. Each path's parameters are selectively updated, mitigating interference across tasks.
Evolving Prototypes in Federated Learning: Vertical Federated Continual Learning (V-LETO) maintains compact class prototypes at the server, merging new and old task representations and aligning incoming data from distributed clients to these evolving centroids. Such server-side regularization ensures vertical consistency across parties and time (Wang et al., 13 Feb 2025).

3. Regularization, Replay, and Mode Connectivity

Guarding against vertical forgetting at each stage has motivated a spectrum of algorithmic techniques:

Regularization-based Methods: Elastic Weight Consolidation (EWC)-style quadratic penalties, Fisher Information weighting, and KL-divergence constraints anchor updated parameters to their previous values. For example, in continual adaptation of LLMs, loss functions of the form

$\mathcal{L}(\theta) = \mathcal{L}_{\text{new}}(\theta) + \lambda \sum_{i} F_i (\theta_i - \theta_{i,0})^2$

stabilize key capacities (Shi et al., 2024, Wang et al., 13 Feb 2025).

Replay-based Methods: Episodic buffers retaining representative samples or prototypes from earlier vertical stages are interleaved with new data, providing gradient signals that counteract forgetting even when original corpora are inaccessible. This technique is standard in both CPT/DAP for LLMs and in class/feature-incremental federated learning (Shi et al., 2024, Wang et al., 13 Feb 2025).
Linear Mode Connectivity: Vertical continuity can be operationalized by enforcing that solutions after new task adaptations remain connected to previous optima via loss-barrier-free linear paths in parameter space. Mirzadeh et al. introduce a MC-SGD algorithm that regularizes each new iterate to maintain linear connectivity with the multitask solution, empirically mitigating forgetting and closing much of the gap to multitask training (Mirzadeh et al., 2020).

4. Information-Theoretic and Bayesian Perspectives

The theoretical underpinnings of vertical continual learning are increasingly articulated in Bayesian and information-theoretic terms:

Hierarchical Posterior Constraints: Continual learning can be formulated as recursive Bayesian inference, where the posterior after $t-1$ tasks forms the prior for task $t$ . Hierarchically Structured Task-Agnostic CL generalizes this through the HVCL objective, imposing KL-divergence penalties on both expert routing distributions and expert parameters at each layer, balancing utility against information stability (Hihn et al., 2022):

$\mathcal{L}_{\text{HVCL}}^t = -\sum_{i=1}^{N_t}\E_{p(\Theta)}[\mathbb{U}(x_i^t, f_\Theta(x_i^t))] + \beta_1 \DKL[p_t(m|x)\|p_{1:t-1}(m|x)] + \beta_2 \DKL[p_t(\theta|m)\|p_{1:t-1}(\theta|m)]$

Rate–Distortion and Information Bottlenecks: The learning-forgetting trade-off is recast as an information bottleneck, maximizing expected utility less mutual information terms between gating/experts and inputs, enabling scalable, bounded-rationality continual learning (Hihn et al., 2022).

5. Evaluation Protocols and Empirical Results

Evaluation of vertical continual learning effectiveness relies on accuracy/forgetting metrics adapted to hierarchical adaptation:

CL Benchmarks: Standard protocol involves sequential tasks on Split-MNIST, Permuted-MNIST, Split-CIFAR-10/100, and variants thereof. HVCL achieves high vertical continuity, with only marginal drops in test accuracy relative to multitask upper bounds and strong resistance to catastrophic forgetting. MC-SGD exhibits similar behavior, with forgetting rates (average accuracy drop on past tasks) close to zero at only minimal replay buffer costs (Mirzadeh et al., 2020, Hihn et al., 2022).
Federated and Vertical Learning Metrics: In VFL, both class-incremental (CIL) and feature-incremental (FIL) regimes are examined. V-LETO outperforms earlier methods by over 10 percentage points in CIL and 35 points in FIL on benchmarks such as CIFAR-10 and CINIC-10, demonstrating reduced forgetting and superior average accuracy (Wang et al., 13 Feb 2025).
LLM Specialization Sequences: Empirical studies catalog the impact of continual pre-training, domain-adaptive pre-training, and fine-tuning, using metrics including overall performance (OP), forgetting (F), and forward transfer (FWT). Mixing general and specialized corpora or parameter-efficient expands are shown to stem vertical forgetting (Shi et al., 2024).

6. Open Challenges and Research Directions

Several open questions and limitations persist:

Task and Objective Heterogeneity: Differences between pre-training (self-supervised) and fine-tuning (supervised, often instruction-based) objectives introduce alignment challenges across vertical stages (Shi et al., 2024).
Data Access and Privacy Constraints: Practical regimes often preclude access to upstream data, necessitating proxies or synthetic rehearsal buffers and prompting investigation into privacy-preserving buffer-free methods (Kozal et al., 2022).
Architectural and Regularization Tradeoffs: Balancing memory/parameter growth and knowledge retention remains critical, especially in depth-expanding and federated scenarios.
Evaluation Standards: Lack of unified multi-level benchmarks and cross-stage metrics hinders standardized comparative assessment.
Theoretical Foundations: Precise generalization guarantees and capacity/utility tradeoffs across vertical adaptation phases await further formalization (Shi et al., 2024, Hihn et al., 2022).

A plausible implication is that advances in vertical continual learning will depend on principled combinations of structural modularity, information-theoretic stability, selective replay, and federated consensus mechanisms, tailored to hierarchically complex and privacy-sensitive machine learning deployments.