Continual Text-to-Video Retrieval
- CTVR is a continual learning paradigm that incrementally aligns text and video embeddings to handle growing semantic categories and manage catastrophic forgetting.
- It leverages advanced models like StructAlign and StableFusion, which employ geometric priors, temporal attention, and adaptive adapters to preserve cross-modal alignment.
- Benchmark evaluations on MSR-VTT and ActivityNet show that cache-based strategies and tailored loss functions effectively counteract feature drift and maintain retrieval performance.
Continual Text-to-Video Retrieval (CTVR) is the problem of incrementally learning to align and retrieve videos using text queries in a lifelong, task-based learning protocol where new semantic categories are introduced over time. The central challenge is to ensure that the retrieval system remains effective for all previously seen categories while adapting to new ones, thus contending with the phenomena of catastrophic forgetting and feature drift unique to the multimodal and continual setting (Wang et al., 28 Jan 2026, Zhao et al., 13 Mar 2025).
1. Problem Definition and Evaluation Protocols
CTVR formalizes the scenario in which models process a stream of tasks. For each task , only the dataset
is available, where is a text query, a video, and its semantic class. Datasets from different tasks contain disjoint class labels. The text and video are respectively encoded into -dimensional features intended to lie in a shared cross-modal embedding space.
At any stage , the retrieval objective is: given a text query , retrieve its corresponding video from the union over all videos accumulated across tasks . The principal metric is retrieval quality (Recall@K, Median/Mean Rank); catastrophic forgetting is quantified as Backward Forgetting (BWF)—the mean drop in retrieval accuracy for previous tasks after subsequent updates (Wang et al., 28 Jan 2026, Zhao et al., 13 Mar 2025).
Typical benchmarks for CTVR include MSR-VTT (split into 10 or 20 tasks with 16 shots per class) and ActivityNet (200 categories, similarly partitioned), with systematic evaluation after each task to assess both forward transfer and forgetting (Wang et al., 28 Jan 2026, Zhao et al., 13 Mar 2025).
2. Catastrophic Forgetting and Feature Drift
CTVR exposes specific challenges from classic continual learning compounded by cross-modal alignment:
- Catastrophic Forgetting: New-task learning tends to overwrite parameters, leading to the loss of alignment for previously learned classes.
- Feature Drift: Two types are prominent:
- Intra-modal drift: Feature changes within a single modality (text or video) due to continual updates.
- Non-cooperative drift: Misalignment between modalities—textual and visual features that correspond semantically may diverge across tasks.
Conventional Pretrained Model (PTM)-based text-to-video retrieval systems suffer from insufficient plasticity for new tasks, while classic continual learning approaches lack mechanisms to preserve semantic alignment across the modalities and are prone to misalign previously indexed queries and videos (Zhao et al., 13 Mar 2025).
3. Architectural Approaches in CTVR
Two primary state-of-the-art solutions—StructAlign and StableFusion—exemplify contemporary architectural strategies.
StructAlign (Wang et al., 28 Jan 2026):
- Introduces a simplex Equiangular Tight Frame (ETF) geometric prior for category prototypes:
Prototypes are computed from orthogonal bases, enforcing global geometric separation.
- For each text token and video frame feature, MLPs project into this prototype space, and prototype-guided attention produces pooled, normalized representations per sample.
- Maintains running category means for old classes. During incremental updates, pseudo-features synthesized from these means are used in alignment to mitigate forgetting in the absence of real data.
StableFusion/FrameFusionMoE (Zhao et al., 13 Mar 2025):
- Embeds a Frame Fusion Adapter (FFA): Temporal cross-attention modules are inserted in each layer of the frozen ViT-based CLIP image encoder, fusing current and previous frame representations (via ). Only adapter weights and scalars are trained; original encoder weights are frozen.
- Incorporates a Task-Aware Mixture-of-Experts (TAME) in the CLIP text encoder: LoRA-based expert modules are routed by task prototypes, mitigating drift in textual embeddings across tasks. At test time, gating produces multiple text features (per prototype), with scoring performed for each against cached video embeddings.
- Both architectures employ cache-based approaches: after each task, all computed video embeddings are frozen and added to a cumulative database used for retrieval. Historical queries are compared against all previously cached videos, making preservation of alignment imperative.
4. Learning Objectives and Losses
CTVR models utilize multi-term objectives to counteract forgetting and drift:
1. StructAlign Objectives (Wang et al., 28 Jan 2026):
- Symmetric Contrastive Loss (): Standard InfoNCE on text-to-video and video-to-text retrieval.
- Cross-Modal ETF Alignment Loss ():
Enforces alignment of pooled features with ETF prototypes.
- Cross-Modal Relation Preserving Loss ():
Penalizes changes in the cross-modal similarity matrix between current and previous model, anchoring intra-modal features by their cross-modal relations.
- The objectives combine , , and (for ) , with recommended hyperparameters , .
2. StableFusion/FrameFusionMoE Objectives (Zhao et al., 13 Mar 2025):
- In-batch cross-modal InfoNCE loss for retrieval.
- Cross-task loss (): Augments standard contrastive loss by incorporating cached video embeddings from previous tasks as negatives, ensuring new updates do not degrade past retrieval performance.
- The total loss is:
with for the first task and for subsequent tasks.
5. Experimental Benchmarks and Results
Both StructAlign and StableFusion models have been evaluated on standard CTVR benchmarks:
| Method | R@1 | R@5 | R@10 | BWF | Params (M) |
|---|---|---|---|---|---|
| StructAlign (MSR-VTT, 10-task) | 25.98 | 46.51 | 57.13 | -0.47 | 33.9 |
| StableFusion (MSR-VTT, 10-task) | 25.87 | N/A | N/A | -0.70 | 46.8 |
| StructAlign (ActNet, 10-task) | 18.26 | 40.31 | 54.68 | -0.68 | 33.9 |
StructAlign consistently displays lower average forgetting, state-of-the-art R@5/R@10, competitive or superior R@1, and reduced trainable parameter counts compared to contemporary baselines including zero-shot CLIP, CLIP4Clip, X-Pool, and CLIP-ViP. StableFusion likewise achieves the best R@1 and near-zero or negative BWF values, indicative of robust retention (Wang et al., 28 Jan 2026, Zhao et al., 13 Mar 2025).
Ablation studies show that, for StructAlign:
- is essential for enforcing global prototype separation and cross-modal alignment.
- reduces intra-modal drift.
- Combined, they yield complementary improvements in performance and stability.
For StableFusion:
- Eliminating FFA leads to major R@1 reductions (8.6 absolute on MSR-VTT-10) and increases BWF.
- Removing TAME, task prototypes, or cross-task loss also degrades retrieval and increases forgetting.
Qualitative analyses for StructAlign reveal that category prototypes after training approximate a uniform equiangular similarity structure, and intra-category diversity remains controlled (Mean Intra-Category Dispersion decreases but avoids collapse).
6. Comparisons to Related Paradigms and Practical Considerations
CTVR is distinct from classic continual learning due to cross-modal complications (feature space drift, semantic misalignment) and from static TVR due to the necessity of task-incremental updates. Both StructAlign and StableFusion approaches freeze the majority of large (CLIP) encoder parameters, employing lightweight adapters for plasticity while preserving base alignment (Wang et al., 28 Jan 2026, Zhao et al., 13 Mar 2025).
In both, maintaining a cumulative database of latent video representations (frozen after each task) decouples training from historical raw data or labels, a critical scaling property. Efficient parameterization (33.9M–46.8M trainable parameters—about one-third of full CLIP4Clip) is achieved via adapters (MoE, LoRA) and attention modules.
These frameworks establish core algorithmic principles for future extensible and scalable CTVR systems:
- Geometric priors enforce cross-modal structure (StructAlign)
- Temporal modeling for video via frame-adaptive attention (StableFusion)
- Modular adapters balance stability and plasticity
- Cache-based continual evaluation protocols.
7. Future Directions and Open Questions
Ongoing directions in CTVR research include the extension to open-vocabulary or open-world settings in which task boundaries and class vocabularies become fluid; adaptation to unsupervised or weakly supervised annotation regimes; the integration of memory-augmented replay for more sophisticated rehearsal; and further investigation of the theoretical properties of geometric priors for lifelong multimodal representation alignment.
A plausible implication is that continued refinement of cross-modal regularization objectives, and modular adaptive architectures, will be central to scalable lifelong retrieval in high-dimensional, multimodal scenarios where both semantics and data distributions shift over time.
References:
- "StructAlign: Structured Cross-Modal Alignment for Continual Text-to-Video Retrieval" (Wang et al., 28 Jan 2026)
- "Continual Text-to-Video Retrieval with Frame Fusion and Task-Aware Routing" (Zhao et al., 13 Mar 2025)