Audio-Visual Continual Learning

Updated 19 November 2025

Audio-visual continual learning is an approach where models incrementally update their knowledge from streaming multimodal data while avoiding catastrophic forgetting.
It employs techniques like patch-based alignment, class-token distillation, and prompt tuning to preserve cross-modal relationships and manage resource constraints.
Applications span retrieval, sound separation, segmentation, and multi-task reasoning, with metrics such as recall, mean IoU, and SDR demonstrating performance gains.

Audio-visual continual learning addresses the problem of incrementally acquiring knowledge from streaming, multimodal sources without catastrophic forgetting or reliance on full retraining. It encompasses a spectrum of tasks—including retrieval, classification, segmentation, multi-task reasoning, and sound separation—within frameworks that must preserve cross-modal semantics, efficiently manage memory, and maintain adaptability to new classes, domains, or modalities.

1. Problem Formulation and Core Challenges

Audio-visual continual learning is typically formalized as a sequence of $T$ tasks, each containing samples composed of aligned audio and video (and often label) pairs, $(x^{a}, x^{v}, y)$ , arriving without joint access to prior data. The model $f_\theta$ is required to update its knowledge for new tasks while retaining competence on earlier ones. Key challenges include:

Sparse spatio-temporal correspondence: Fine-grained correlations between audio and video are localized both spatially and temporally and frequently vary between training steps (Lee et al., 2023).
Catastrophic forgetting: Overwriting or drift of previously learned cross-modal representations and associations is typical when models adapt to new data or classes (Yin et al., 29 Jul 2025, Mo et al., 2023).
Multi-modal entanglement and semantic drift: In fine-grained tasks like segmentation, modal semantic drift can cause sounding objects to be misclassified or relegated to background during later tasks; co-occurrence confusions elevate error when classes frequently appear together (Hong et al., 20 Oct 2025).
Resource constraints: Memory or computational limitations threaten scalability; rehearsal or storage should be minimized wherever possible.

The generic goal is maximizing accuracy or task-relevant metrics (e.g., retrieval recall, mean IoU for segmentation, or SDR/SIR/SAR for sound separation), subject to strict constraints on memory or rehearsal and with minimum performance decay (forgetting) on past tasks.

2. Architectures and Learning Frameworks

Representative frameworks for audio-visual continual learning can be categorized as follows:

Patch- or token-based cross-modal alignment: Methods such as STELLA employ spatio-temporally localized alignment using cross-attention between audio and video patches, where patch importance and replay-guided correlation metrics dictate selective memory replay to prevent semantic drift (Lee et al., 2023).
Class-token grouping and distillation: CIGN leverages learnable audio-visual class tokens for semantic aggregation and class-token distillation loss to maintain class-wise semantic integrity, combined with continual grouping via transformer self-attention (Mo et al., 2023).
Prompt/adaptor tuning: Progressive methods like PHP inject adapters and prompts at increasing depths in pre-trained backbones (e.g., frozen CLIP/CLAP), balancing stability and plasticity by selectively freezing shared parameters and dynamically generating prompts for new tasks (three-stage: task-shared, task-specific-modality-shared, and modality-independent) (Yin et al., 29 Jul 2025).
Collision-based rehearsal: Specialized frameworks for segmentation such as CMR tackle semantic drift and confusion through multi-modal sample selection—selecting cross-modally consistent exemplars—and collision-based rehearsal, where rehearsal sample frequency is biased towards classes at highest risk of confusion (Hong et al., 20 Oct 2025).
Contrastive multi-modal alignment with semantic regularization: COMM aligns various modalities with text representations, employing cross-modal knowledge aggregation weighted by learned relevance networks and self-regularization within modalities to mitigate drift (Jin et al., 11 Mar 2025).
Biologically inspired models: AVIM implements multi-compartment Hodgkin–Huxley neurons and calcium-based synaptic tagging and capture mechanisms for integration and concept formation, achieving continual learning sans explicit replay by leveraging localized plasticity (Chen et al., 2020).

Modern approaches deploy a combination of the following mechanisms for balancing plasticity and stability:

Rehearsal buffers and selective memory: Storage of selected (often cross-modally salient) past data for replay (as in STELLA's patch selection (Lee et al., 2023) or CIGN's class-wise buffers (Mo et al., 2023)) mitigates catastrophic forgetting.
Cross-modal distillation constraints: ContAV-Sep introduces a Cross-modal Similarity Distillation Constraint (CrossSDC), a contrastive and distillation-based regularizer maintaining instance-level and class-level audio-visual semantic similarity across incremental tasks (Pian et al., 2024).
Prompt pooling and dynamic generation: PHP's TMDG and TMI stages synthesize task-aware prompts that encode priors from earlier learned knowledge, freezing parameters to preserve homeostasis while adapting new modules for plasticity (Yin et al., 29 Jul 2025).
Class-token distillation and continual grouping: Class tokens are aligned to their historical distributions, and contrastive grouping loss further safeguards embedding integrity for previously learned categories (Mo et al., 2023).
Sample selection and collision-based rehearsal: MSS and CSR modules in CMR select cross-modally consistent samples and focus rehearsal frequency on confusable classes, optimizing memory utilization for robust segmentation performance (Hong et al., 20 Oct 2025).
Intra-modal and cross-modal regularization: COMM incorporates intra-modal prompt aggregation to constrain within-modality drift and relevance-based cross-modal prompt aggregation to exploit joint knowledge across modalities (Jin et al., 11 Mar 2025).
Biophysical synaptic stabilization: AVIM's concept abstraction and synaptic tagging ensure lifelong stability of multimodal representations via coincidence detection at dendritic compartments (Chen et al., 2020).

4. Task Domains and Evaluation Protocols

Audio-visual continual learning extends to diverse tasks:

Retrieval and classification: Zero-shot and fine-tuned retrieval of cross-modal pairs, event localization, and mean recall metrics are standard (Lee et al., 2023, Mo et al., 2023).
Sound separation: The continual audio-visual sound separation task is evaluated via SDR, SIR, SAR, and memory-based rehearsal. CrossSDC-based methods in ContAV-Sep outperform all baselines on incremental mixtures and maintain old-class separation fidelity (Pian et al., 2024).
Segmentation: The CAVS paradigm addresses class-incremental segmentation with evaluation on class-wise, overall, and old/new class mean IoU (Hong et al., 20 Oct 2025).
Multi-task reasoning: PHP demonstrates robust retention and transfer across event localization, parsing, QA, and segmentation datasets, maintaining high accuracy and low forgetting in varied permutations of task sequences (Yin et al., 29 Jul 2025).
General multimodal streams: COMM validates on mixed sequences of image, video, audio, depth, and text, reporting average incremental accuracy, per-task retention, and forgetting across arbitrary task orders and modal configurations (Jin et al., 11 Mar 2025).
Biological simulation: AVIM evaluates continual learning paradigms on object/image datasets via class-incremental, few-shot protocols simulating lifelong exposure (Chen et al., 2020).

5. Comparative Results and Ablation Insights

Published results establish substantial gains over single-modality and prior prompt-based continual learning baselines:

STELLA: +3.69%p relative recall gain in retrieval, and ≈45% reduction in GPU memory versus explicit rehearsal strategies (Lee et al., 2023).
PHP: Best overall mean accuracy ( $A_{\text{mean}}=58.85\%$ ), lowest forgetting ( $F_{\text{mean}}=3.32\%$ ), and positive cross-task transfer ( $\text{Diff}=+7.79\%$ ), outperforming all tested baselines (Yin et al., 29 Jul 2025).
CIGN: Sets state-of-the-art accuracy and lowest forgetting (e.g., 70.1% AV accuracy, 2.6% forgetting on VGGSound-Instruments), exceeding DualPrompt and token-based approaches (Mo et al., 2023).
CMR: Delivers up to +14.2 mIoU gain over single-modal segmentation, with both sample selection and collision-based rehearsal contributing additive improvements (Hong et al., 20 Oct 2025).
COMM: Achieves +10.87% incremental accuracy gain and +3.89% reduction in forgetting versus prompt baselines, robust in both modality-specific and modality-agnostic conditions (Jin et al., 11 Mar 2025).
AVIM: Matches or exceeds replay-based continual learning in low-sample, high-class incremental settings, enabling biologically plausible replay-free stability (Chen et al., 2020).
ContAV-Sep: Surpasses all baselines on continual sound separation; +0.2–0.3 dB SDR improvement and strong retention of old class separation regardless of rehearsal buffer size (Pian et al., 2024).

Ablations reveal that combinations of modular mechanisms (e.g., token distillation + grouping, or multi-stage prompt tuning) consistently outperform any single regularization or rehearsal strategy. Increased memory or rehearsal further improves retention but is not strictly necessary for robust anti-forgetting when cross-modal constraints are used. Some methods (PHP, AVIM) dispense with rehearsal entirely via architectural or synaptic regularization.

6. Open Issues, Limitations, and Prospects

Despite advances, several challenges remain:

Memory constraints: Most frameworks rely on explicit rehearsal buffers; minimizing or eliminating such dependence is a key direction (Pian et al., 2024, Mo et al., 2023).
Stability–plasticity trade-offs: Longer or deeper prompts can improve stability but may impede transfer, with ablation studies indicating the optimal balance depends both on prompt parametrization and architectural insertion point (Yin et al., 29 Jul 2025).
Modality and domain gaps: Adaptive relevance-weighted cross-modal aggregation (as in COMM) is necessary to avoid biased alignment, especially for modalities that arrive in disparate temporal blocks (Jin et al., 11 Mar 2025).
Object detection reliability: Sound separation frameworks depend critically on external detectors (Detic), and missed detections degrade performance (Pian et al., 2024).
Biological realism: While AVIM demonstrates plausible lifelong integration, its use of engineered codes and restrictive schedules suggest further investigation into unsupervised or language-based supervisory streams (Chen et al., 2020).
Scalability: CAVS and similar segmentation approaches highlight the difficulty of scaling cross-modal rehearsal as class granularity grows (Hong et al., 20 Oct 2025).
Generalization: Extension to open-domain mixtures, multi-source compositions, and joint optimization of multiple perception tasks remain active frontiers.

A plausible implication is that future progress may integrate generative replay, modular regularization, adaptive cross-modal attention, and biologically motivated stabilization mechanisms, while reducing explicit reliance on old-class storage or manually annotated object/sound tracks.

7. References to Foundational Works

STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment (Lee et al., 2023)
Progressive Homeostatic and Plastic Prompt Tuning for Audio-Visual Multi-Task Incremental Learning (Yin et al., 29 Jul 2025)
Class-Incremental Grouping Network for Continual Audio-Visual Learning (Mo et al., 2023)
Taming Modality Entanglement in Continual Audio-Visual Segmentation (Hong et al., 20 Oct 2025)
Continual Learning for Multiple Modalities (Jin et al., 11 Mar 2025)
A Biologically Plausible Audio-Visual Integration Model for Continual Learning (Chen et al., 2020)
Continual Audio-Visual Sound Separation (Pian et al., 2024)