Video Continual Learning Methods

Updated 8 December 2025

Video continual learning methods are frameworks that incrementally learn from streaming video data using both supervised and unsupervised approaches.
They employ modular architectures such as adapter-based fusion, prompt-based adaptation, and neuro-inspired mechanisms to mitigate catastrophic forgetting.
Memory replay techniques, including compressed buffers and non-parametric clustering, enable sustained performance under high-dimensional, evolving video streams.

Video continual learning methods comprise algorithmic frameworks, architectures, and protocols designed to incrementally acquire and retain knowledge from streaming video data, spanning both supervised and unsupervised regimes. These methods address high-dimensional, temporally correlated inputs, domain shifts, non-IID distributions, sparse or evolving annotation, and catastrophic forgetting. State-of-the-art approaches encompass modular adapter-based fusion, memory-efficient rehearsal, non-parametric clustering, prompt-based adaptation, neuro-inspired mechanisms, and compression-based replay buffers, as reflected in recent literature.

1. Formal Definitions and Problem Settings

Video continual learning aims to build models that learn from streams or sequential batches of videos, adapting to new domains, classes, or queries while minimizing loss in previously acquired knowledge. Key settings include:

Class-Incremental Learning (CIL): Model receives a sequence of video tasks, each introducing new action classes, and must continually expand its prediction set without catastrophic forgetting (Villa et al., 2022, Alssum et al., 2023).
Domain-Incremental & Query-Incremental Learning: Distribution/domain or task boundaries are introduced sequentially, e.g., video QA spanning multiple datasets or evolving query types (Cheng et al., 13 Mar 2024, Xu et al., 30 Nov 2025, Tang et al., 19 Jun 2024).
Online and Streaming Learning: Data arrives as a temporally ordered video stream; methods must process frames or clips without shuffling, replay, or batch aggregation (Carreira et al., 2023).
Unsupervised Video Continual Learning: No labels or task boundaries are provided; models must discover structure via clustering or representation learning and self-supervised objectives (Kurpukdee et al., 29 Aug 2025, Tiezzi et al., 2022).
Real-World Constraints: High-dimensionality, non-IID drift, bounded memory/replay, sparse annotation, real-time inference, and dynamic object/class space (Nazemi et al., 2023, Wu et al., 2022, Castagnolo et al., 2023, Mall et al., 7 Aug 2025).

Standard metrics include average accuracy over all tasks ( $A_T$ ), backward/forward forgetting ( $F_T$ , BWF), cluster accuracy (unsupervised), recall@k, and efficiency measures (GFLOPs, buffer size, memory footprint).

2. Architectural Advances and Modularization

Recent continual video learning frameworks leverage frozen or modular backbones to decouple adaptation from large-scale model drift:

Dynamic Adapter Merging (DAM): Instantiates dataset-specific adapter modules atop a frozen video-language backbone. Inference merges adapters using non-parametric router weights ( $W_M = \sum_t p_t W_t$ ), promoting cross-domain sharing and robustness under uncertain routing (Cheng et al., 13 Mar 2024).
Prompt-Based Adaptation: Methods like PIVOT and Bisecle embed learnable prompts or adapters at selected layers, either spatially, temporally, or cross-modally. This isolates task-specific learning via parameter-efficient modules or contrastive losses, reducing drift on shared weights (Villa et al., 2022, Tan et al., 1 Jul 2025).
Affordance-First Decomposition (AFD): Explicitly maps incoming video into "affordance tokens" (time-aligned interaction-centric representations) forming a stable substrate. Lightweight query-routed schedulers (LoRA adapters with dynamic rank) concentrate all plasticity on task-specific adaptation, with stability in affordance encoding (Xu et al., 30 Nov 2025).
Winning Subnetworks (WSN) and Fourier Subneural Operators (FSO): Selectively prune/activate sparse subnetworks (iteratively learned masks), augmented with spectral-domain FSO for bandwidth-adaptive, reusable video encoding (Kang et al., 2023).

These modular architectures offer scalable adaptation, parameter efficiency (frequently <5% added parameters per task), avoidance of catastrophic forgetting, and robustness to task/domain order.

3. Memory, Replay, and Compression Techniques

Efficient memory management is central to video continual learning, given the dimensionality and duration of input streams:

Rehearsal Buffers: Most high-performing methods use small replay buffers, often storing frames or compact codes instead of raw clips (Castagnolo et al., 2023, Villa et al., 2022, Tang et al., 19 Jun 2024). Selection can be driven by diversity (SMILE: one frame per video (Alssum et al., 2023)), confidence (Castagnolo et al., 2023), or motion salience.
Compression-Based Memory: CRAM continuously compresses video clips into vector-quantized codes via an online-trained autoencoder. To prevent code drift, the method "refreshes" buffer codes after compressor updates using old decoder/new encoder (Mall et al., 7 Aug 2025).
Non-Parametric Memory: Unsupervised methods applied kernel density estimation (KDE) to learn video feature clusters, maintaining buffers per cluster with FIFO discipline. Novelty in streaming data is detected via minimum distance to existing clusters or RBF classifier softmax thresholds (Kurpukdee et al., 29 Aug 2025).
Task-Agnostic and Question-Only Replay: Some frameworks (AFD) opt not to store past video frames, but only past queries for teacher distillation; this is memory- and privacy-efficient (Xu et al., 30 Nov 2025).

Compressed buffers enable storage of up to $10^5$ more videos under fixed memory budgets (e.g., O(2 GB for $>10^7$ clips in CRAM), closing performance gaps to joint-training upper bounds (Mall et al., 7 Aug 2025).

4. Continual Learning Protocols and Training Dynamics

Protocols vary across domains but commonly include sequential task arrival, periodic evaluation (both on current and all prior tasks), and explicit mechanisms to curtail catastrophic forgetting:

Frozen Backbone/Adapter Training: Most modern methods freeze core parameters (CLIP, ViT, LLM, etc.), training only minimal adapters/prompts (Cheng et al., 13 Mar 2024, Villa et al., 2022).
Task-Specific Initialization and Isolation: Adapters or subnetworks are initialized from prior tasks or a pre-trained distribution; previous modules are frozen once a new task is learned.
Soft Merging and Routing: Inference procedures employ soft assignment/merging of task-specific adapters or dynamic routing over modules, rather than hard selection, sharing knowledge and mitigating router errors (Cheng et al., 13 Mar 2024, Xu et al., 30 Nov 2025).
Contrastive or Regularization Losses: Binding losses, contrastive prompt separation, EWC/MAS-style parameter regularizers, and self-supervised narration alignment are deployed to maintain domain invariance and reduce forgetting (Tan et al., 1 Jul 2025, Tang et al., 19 Jun 2024, Nazemi et al., 2023).
Optimization Dynamics: Frame-to-frame correlation in streaming video necessitates careful tuning; zero-momentum RMSProp, gradient aggregation, and constant learning rates strike the balance between adaptation and generalization (Carreira et al., 2023).

Some methods (CPL) employ generative replay by producing synthetic sequence rollouts for past tasks, leveraging mixture-of-Gaussians latent priors and non-parametric inference for task identification (Chen et al., 2022).

5. Empirical Results, Benchmarks, and Analytical Insights

Benchmarks are established on large-scale datasets with uniform or domain/task splits, e.g., UCF101, Kinetics-700/400, ActivityNet, Ego4D, CLVOS23, and ViLCo-Bench. Notable results:

Method	VideoQA Avg Acc (%)	Forgetting (%)	Domain/Protocol
DAM (Cheng et al., 13 Mar 2024)	50.2	2.3	6 VidQA datasets, rehearsal-free
AFD (Xu et al., 30 Nov 2025)	51.6	1.8	6 VideoQA domains (domain-incremental)
Bisecle (Tan et al., 1 Jul 2025)	49.4	2.7	NExT-QA, DramaQA, STAR (VideoQA)
ViLCo-Bench (Tang et al., 19 Jun 2024)	26.2 (MQ R@[email protected])	2.9 (BwF)	Egocentric video, query-incremental
CRAM (Mall et al., 7 Aug 2025)	46.5	5.5	Kinetics-700 (compressed buffer, pretrain)

Ablations consistently show that modularity (adapters/prompts), memory-efficient replay, dynamic routing, self-supervised alignment, and stability-focused losses are critical for retaining prior knowledge, generalization, and resisting catastrophic forgetting under domain order variability.

SMILE demonstrates that, under severe memory constraints, storing one frame per video yields higher accuracy than multi-frame or full-clip storage (up to +21.49pp on Kinetics) (Alssum et al., 2023). In unsupervised clustering, uVCL-KDE achieves up to 93% continual cluster accuracy (UCF101) without labels (Kurpukdee et al., 29 Aug 2025).

6. Domain-Specific Applications and Extensions

Video Object Segmentation (VOS): Online methods deploy memory-efficient regularizers (gated, reconstruction-based) to mitigate representation drift and catastrophic forgetting across long video sequences (Nazemi et al., 2023, Nazemi et al., 2023).
Streaming Object Detection: Label-efficient protocols couple fast/slow learners, EMA consolidation, replay, and pseudo-labeling to perform detection with $<25\%$ annotation, outperforming fully supervised sequence learners (Wu et al., 2022).
Action/Class Recognition: Benchmarks (vCLIMB) define protocols for frame-budgeted memory, temporal consistency regularization, and uniform class splits, exposing vulnerabilities to background frames, instance selection, and task length (Villa et al., 2022).
Video-Language Tasks: AFD, Bisecle, and ViLCo set new standards for continual VideoQA and multimodal reasoning, integrating affordance-based representations, prompt separation, and self-supervised alignment (Xu et al., 30 Nov 2025, Tan et al., 1 Jul 2025, Tang et al., 19 Jun 2024).
Generative Models: VidCLearn introduces a continual learning loop atop diffusion-based text-to-video generation, pairing student-teacher distillation, generative replay, temporal-consistency losses, and retrieval-based structural guidance (Zanchetta et al., 21 Sep 2025).
Surveillance and Anomaly Detection: Transfer learning + kNN/cusum detectors are employed for real-time, online anomaly localization with continual expansion of the nominal buffer; these systems avoid weight drift entirely (Doshi et al., 2020).
Video Prediction: CPL and other frameworks exploit mixture-of-Gaussians priors, predictive replay, and nonparametric task inference for seamless adaptation in non-stationary physical environments (Chen et al., 2022, Campo et al., 2020).
Unsupervised Representation: Stochastic-coherence attention, contrastive spatial/temporal losses, and sparse human supervision produce open-set, class-incremental, pixel-wise labeling agents in long video streams (Tiezzi et al., 2022).

7. Limitations, Open Questions, and Future Directions

Current limitations include scalability across $>10$ domains/tasks, routing robustness under subtle domain shifts, integration of generative replay into parameter-efficient architectures, handling truly task-free scenarios, and extension to broader modalities (audio, embodied AI). Substrate stability vs. plasticity allocation, more sophisticated merging (e.g., Fisher-weighted), dynamic replay buffers, and compression-aware continual adaptation remain ripe for paper. Task-order robustness, interpretability, and minimal replay tradeoffs continue to challenge the field.

In sum, video continual learning methods have advanced toward highly modular, memory- and compute-efficient adaptation, leveraging frozen backbones, adapters, prompts, or clustered representations, automated selection/replay, and robust regularization—yielding state-of-the-art “forget-free” continual learning in high-dimensional, multi-domain video streams (Cheng et al., 13 Mar 2024, Xu et al., 30 Nov 2025, Castagnolo et al., 2023, Alssum et al., 2023, Mall et al., 7 Aug 2025).