Composed Video Retrieval (CoVR)
- CoVR is a specialized vision-language retrieval task combining reference videos with modification texts to locate matching target videos.
- It leverages methods like frozen encoders, multi-modal fusion, and contrastive learning to achieve precise spatial and temporal alignment.
- Benchmark datasets such as WebVid-CoVR and Dense-WebVid-CoVR demonstrate significant gains with advanced pooling and cross-attention strategies.
Composed Video Retrieval (CoVR) is a specialized vision-language retrieval task that seeks to identify, within a large-scale video database, the target video whose content best matches a multi-modal composition: the visual content of a reference query video and a natural language modification describing the intended change. CoVR has evolved rapidly in both methodology and benchmarking, and now encompasses challenges spanning fine-grained spatial alignment, temporal reasoning, dense captioning, modality fusion, and practical scalability.
1. Formal Task Definition and Problem Setup
CoVR is formally specified as follows: Given a reference video and a modification text , the system searches a gallery for a target video that conforms to both the content in and the change specified by . The canonical objective is to learn a pair of encoders—a composed query encoder and a video encoder —such that
when is the desired video. Retrieval is performed by scoring candidates:
where is typically cosine similarity.
Variants extend the query to include detailed language description of , yielding enriched multi-modal composition (Thawakar et al., 2024). Datasets used include WebVid-CoVR, where each triplet captures a distinct composition (Ventura et al., 2023), and Dense-WebVid-CoVR, which employs human-verified, lengthy modification texts and descriptions for fine-grained semantic modeling (Thawakar et al., 19 Aug 2025).
2. Model Architectures and Fusion Strategies
Several architectural principles underlie CoVR frameworks:
- Frozen Visual and Text Encoders: Common backbones are ViT-L (visual) and BERT-base transformer (text). Frames are sampled from videos (typically $15$ for WebVid-CoVR).
- Multi-modal Fusion Modules:
- Pairwise Fusion: Earlier work fuses visual embeddings and text via cross-attention (Ventura et al., 2023).
- Three-way Fusion: Recent frameworks sum embeddings from , , and , with the text encoder (Thawakar et al., 2024).
- Unified Cross-Attention Encoder: A single transformer block fuses via cross-attention, outperforming pairwise fusion strategies (Thawakar et al., 19 Aug 2025).
- Hierarchical Alignment: Holistic and atomistic components capture global and fine-grained cross-modal interactions; Q-Former and uncertainty modeling resolve pronoun references and small-object ambiguities (Chen et al., 2 Dec 2025).
- LoRA-Augmented MLLM: Shared multimodal LLM backbone with low-rank adaptation supports corpus-level, moment-level, and composed queries in a unified space (Halbe et al., 17 Jan 2026).
- Multi-stage Cross-Attention: X-Aligner progressively fuses caption, visual, and text editing signals, maintaining pretrained VLM representations (Zheng et al., 23 Jan 2026).
- PREGEN Pooling: Extraction and pooling of hidden states across all VLM layers enables compact compositional embedding, surpassing previous state-of-the-art (Serussi et al., 20 Jan 2026).
3. Training Objectives, Embedding Alignment, and Loss Functions
Most CoVR systems employ contrastive learning over triplets.
- Hard-Negative InfoNCE Loss: Batches are used to generate contrastive pairs, with scores computed for each query-target embedding pair. The loss is typically:
where denotes hard-negative weights, and is the temperature (Thawakar et al., 2024).
- Multi-target Contrastive Loss: Embeddings aligned to three databases—vision-only, text-only, and vision-text—for enhanced discrimination. Learned loss weights (~0.83, 0.08, 0.07) optimize alignment (Thawakar et al., 2024).
- Generalized Contrastive Learning (GCL): Unified loss formulation across all modality pairs within a batch; image, text, and fused modalities are simultaneously optimized, reducing modality gaps (Lee et al., 30 Sep 2025).
- Hierarchical Alignment and Regularization: Holistic-to-atomistic similarity distributions are regularized by KL divergence to ensure semantic coherence (Chen et al., 2 Dec 2025).
- PREGEN Pooling: Layerwise hidden states from frozen VLM, aggregated using a lightweight transformer encoder and MLP, form highly semantic representations for retrieval (Serussi et al., 20 Jan 2026).
4. Benchmark Datasets and Evaluation Protocols
Key CoVR datasets and their protocols include:
| Dataset | # Triplets | Description Type | Modification Length | Target Evaluation |
|---|---|---|---|---|
| WebVid-CoVR | 1.6M | Short captions | 4.8 words | Manual curation, R@K |
| Dense-WebVid-CoVR | 1.6M | 81-word description | 31 words | Human-verified, R@K |
| EgoCVR | 2,295 | Egocentric action vids | 1.2 GT/query | Temporal, Recall@K, Local |
| TF-CoVR | 180K | Gymnastics/diving | 2–19 words | Multi-GT, mAP@K |
Recall@K is the primary metric for rank-based evaluation. TF-CoVR employs mean Average Precision at cutoff K (mAP@50), favoring robust multi-target retrieval (Gupta et al., 5 Jun 2025).
5. Empirical Results and Ablation Insights
Recent advances yield significant improvements in R@1 across benchmarks:
| Model / Approach | Dataset | R@1 (%) | Reference |
|---|---|---|---|
| Baseline CoVR-BLIP | WebVid-CoVR | 53.13 | (Ventura et al., 2023) |
| Enriched Context + Multi-target | WebVid-CoVR | 60.12 | (Thawakar et al., 2024) |
| Dense Description + Unified CA Fusion | Dense-WebVid-CoVR | 71.26 | (Thawakar et al., 19 Aug 2025) |
| X-Aligner (BLIP-2 variant) | WebVid-CoVR-Test | 63.93 | (Zheng et al., 23 Jan 2026) |
| HUD (Holistic/Atomistic) | WebVid-CoVR | 63.38 | (Chen et al., 2 Dec 2025) |
| PREGEN (Qwen2.5-VL 7B) | WebVid-CoVR | 99.73 | (Serussi et al., 20 Jan 2026) |
| VIRTUE-Embed 7B | WebVid-CoVR | 55.49 (ZS) | (Halbe et al., 17 Jan 2026) |
| GCL (VISTA backbone) | CoVR Benchmark | 37.52 (ZS) | (Lee et al., 30 Sep 2025) |
Ablation studies show that combining visual, text, and description signals raises recall from ~27–41% (single modality) to above 60%, and dense, human-generated descriptions yield further gains. Unified cross-attention delivers +3.4% over pairwise fusion (Thawakar et al., 19 Aug 2025). PREGEN's layerwise pooling over all VLM layers achieves a recall nearly at the theoretical maximum for curated data (Serussi et al., 20 Jan 2026).
6. Challenges, Extensions, and Open Problems
Multiple axes of complexity drive current research:
- Fine-grained Compositionality: Subtle actions, spatial region selection, temporal order, and pronoun reference require hierarchical fusion, uncertainty modeling, and cross-modal interaction modules (Chen et al., 2 Dec 2025).
- Dense Captioning and Modification: Longer, multi-sentence modifications and detailed video descriptions are necessary for fine semantic control (Thawakar et al., 19 Aug 2025).
- Temporal Reasoning: Benchmarks such as EgoCVR and TF-CoVR emphasize retrieving segments based on subtle action, duration, and event changes. Motion-sensitive video encoders and action-class pretraining are critical (Gupta et al., 5 Jun 2025, Hummel et al., 2024).
- Zero-shot and Cross-domain Generalization: Transfer to image retrieval (CoIR), multi-modal retrieval, and textual-only or frame-only queries demonstrates flexibility. Strategies include synthetic triplet generation, pseudo-labeling, and knowledge integration (Zhang et al., 3 Mar 2025).
- Scalability and Training Efficiency: Frameworks using frozen LLM backbones, LoRA, and lightweight adapters substantially reduce training cost without sacrificing accuracy (Halbe et al., 17 Jan 2026, Serussi et al., 20 Jan 2026).
- Annotation and Data Quality: Auto-generated triplets are noisy (~22% discarded in WebVid-CoVR), requiring filtering and high-quality captioning tools. Dense-WebVid-CoVR addresses this via human verification (Thawakar et al., 19 Aug 2025).
7. Outlook and Future Directions
Current trends point to several research frontiers:
- End-to-end video LLMs for compositional editing and retrieval, capturing interaction between audio, text, and vision (Zheng et al., 23 Jan 2026);
- Interactive CoVR systems capable of multi-turn refinement and iterative query enhancement;
- Joint retrieval and moment localization, grounding composed queries in both retrieval ranking and exact segment boundaries;
- Expansion of fine-grained CoVR beyond appearance-centric or egocentric domains into high-motion, multi-agent domains;
- Unified representations spanning images, videos, and text using generalized contrastive objectives and cross-modal learning (Lee et al., 30 Sep 2025).
Recent empirical findings confirm state-of-the-art performance from PREGEN (Serussi et al., 20 Jan 2026), extensive gain from enriched context and discriminative alignment (Thawakar et al., 2024), robust temporal handling via TF-CoVR-Base (Gupta et al., 5 Jun 2025), and new scaling pathways from LoRA-based models (Halbe et al., 17 Jan 2026). CoVR continues to serve as the core methodological bridge uniting structured compositional search with scalable multimedia understanding in modern video retrieval systems.