Composed Video Retrieval (CoVR)

Updated 30 January 2026

CoVR is a specialized vision-language retrieval task combining reference videos with modification texts to locate matching target videos.
It leverages methods like frozen encoders, multi-modal fusion, and contrastive learning to achieve precise spatial and temporal alignment.
Benchmark datasets such as WebVid-CoVR and Dense-WebVid-CoVR demonstrate significant gains with advanced pooling and cross-attention strategies.

Composed Video Retrieval (CoVR) is a specialized vision-language retrieval task that seeks to identify, within a large-scale video database, the target video whose content best matches a multi-modal composition: the visual content of a reference query video and a natural language modification describing the intended change. CoVR has evolved rapidly in both methodology and benchmarking, and now encompasses challenges spanning fine-grained spatial alignment, temporal reasoning, dense captioning, modality fusion, and practical scalability.

1. Formal Task Definition and Problem Setup

CoVR is formally specified as follows: Given a reference video $q$ and a modification text $t$ , the system searches a gallery $V = \{v_i\}$ for a target video $v^*$ that conforms to both the content in $q$ and the change specified by $t$ . The canonical objective is to learn a pair of encoders—a composed query encoder $f$ and a video encoder $g$ —such that

$f(q, t) \approx g(v^*)$

when $v^*$ is the desired video. Retrieval is performed by scoring candidates:

$v^* = \arg\max_{v \in V}\ \mathrm{sim}(f(q, t), g(v))$

where $\mathrm{sim}(\cdot, \cdot)$ is typically cosine similarity.

Variants extend the query to include detailed language description $d$ of $q$ , yielding enriched multi-modal composition $f(q, d, t)$ (Thawakar et al., 2024). Datasets used include WebVid-CoVR, where each triplet $(q, t, v^*)$ captures a distinct composition (Ventura et al., 2023), and Dense-WebVid-CoVR, which employs human-verified, lengthy modification texts and descriptions for fine-grained semantic modeling (Thawakar et al., 19 Aug 2025).

2. Model Architectures and Fusion Strategies

Several architectural principles underlie CoVR frameworks:

Frozen Visual and Text Encoders: Common backbones are ViT-L (visual) and BERT-base transformer (text). Frames are sampled from videos (typically $15$ for WebVid-CoVR).
Multi-modal Fusion Modules:
- Pairwise Fusion: Earlier work fuses visual embeddings $g(q)$ and text $t$ via cross-attention (Ventura et al., 2023).
- Three-way Fusion: Recent frameworks sum embeddings from $f(q, t)$ , $f(q, d)$ , and $f(e(d), t)$ , with $e$ the text encoder (Thawakar et al., 2024).
- Unified Cross-Attention Encoder: A single transformer block fuses $(q, d, t)$ via cross-attention, outperforming pairwise fusion strategies (Thawakar et al., 19 Aug 2025).
- Hierarchical Alignment: Holistic and atomistic components capture global and fine-grained cross-modal interactions; Q-Former and uncertainty modeling resolve pronoun references and small-object ambiguities (Chen et al., 2 Dec 2025).
- LoRA-Augmented MLLM: Shared multimodal LLM backbone with low-rank adaptation supports corpus-level, moment-level, and composed queries in a unified space (Halbe et al., 17 Jan 2026).
- Multi-stage Cross-Attention: X-Aligner progressively fuses caption, visual, and text editing signals, maintaining pretrained VLM representations (Zheng et al., 23 Jan 2026).
- PREGEN Pooling: Extraction and pooling of hidden states across all VLM layers enables compact compositional embedding, surpassing previous state-of-the-art (Serussi et al., 20 Jan 2026).

3. Training Objectives, Embedding Alignment, and Loss Functions

Most CoVR systems employ contrastive learning over triplets.

Hard-Negative InfoNCE Loss: Batches are used to generate contrastive pairs, with $S_{i, j}$ scores computed for each query-target embedding pair. The loss is typically:

$\mathcal{L}_{\mathrm{HN}} = -\sum_{i \in \mathcal{B}} \log\left(\frac{\exp(S_{i, i}/\tau)}{\exp(S_{i, i}/\tau) + \sum_{j \neq i} w_{i, j} \exp(S_{i, j}/\tau)}\right)$

where $w_{i, j}$ denotes hard-negative weights, and $\tau$ is the temperature (Thawakar et al., 2024).

Multi-target Contrastive Loss: Embeddings aligned to three databases—vision-only, text-only, and vision-text—for enhanced discrimination. Learned loss weights (~0.83, 0.08, 0.07) optimize alignment (Thawakar et al., 2024).
Generalized Contrastive Learning (GCL): Unified loss formulation across all modality pairs within a batch; image, text, and fused modalities are simultaneously optimized, reducing modality gaps (Lee et al., 30 Sep 2025).
Hierarchical Alignment and Regularization: Holistic-to-atomistic similarity distributions are regularized by KL divergence to ensure semantic coherence (Chen et al., 2 Dec 2025).
PREGEN Pooling: Layerwise hidden states from frozen VLM, aggregated using a lightweight transformer encoder and MLP, form highly semantic representations for retrieval (Serussi et al., 20 Jan 2026).

4. Benchmark Datasets and Evaluation Protocols

Key CoVR datasets and their protocols include:

Dataset	# Triplets	Description Type	Modification Length	Target Evaluation
WebVid-CoVR	1.6M	Short captions	4.8 words	Manual curation, R@K
Dense-WebVid-CoVR	1.6M	81-word description	31 words	Human-verified, R@K
EgoCVR	2,295	Egocentric action vids	1.2 GT/query	Temporal, Recall@K, Local
TF-CoVR	180K	Gymnastics/diving	2–19 words	Multi-GT, mAP@K

Recall@K is the primary metric for rank-based evaluation. TF-CoVR employs mean Average Precision at cutoff K (mAP@50), favoring robust multi-target retrieval (Gupta et al., 5 Jun 2025).

5. Empirical Results and Ablation Insights

Recent advances yield significant improvements in R@1 across benchmarks:

Model / Approach	Dataset	R@1 (%)	Reference
Baseline CoVR-BLIP	WebVid-CoVR	53.13	(Ventura et al., 2023)
Enriched Context + Multi-target	WebVid-CoVR	60.12	(Thawakar et al., 2024)
Dense Description + Unified CA Fusion	Dense-WebVid-CoVR	71.26	(Thawakar et al., 19 Aug 2025)
X-Aligner (BLIP-2 variant)	WebVid-CoVR-Test	63.93	(Zheng et al., 23 Jan 2026)
HUD (Holistic/Atomistic)	WebVid-CoVR	63.38	(Chen et al., 2 Dec 2025)
PREGEN (Qwen2.5-VL 7B)	WebVid-CoVR	99.73	(Serussi et al., 20 Jan 2026)
VIRTUE-Embed 7B	WebVid-CoVR	55.49 (ZS)	(Halbe et al., 17 Jan 2026)
GCL (VISTA backbone)	CoVR Benchmark	37.52 (ZS)	(Lee et al., 30 Sep 2025)

Ablation studies show that combining visual, text, and description signals raises recall from ~27–41% (single modality) to above 60%, and dense, human-generated descriptions yield further gains. Unified cross-attention delivers +3.4% over pairwise fusion (Thawakar et al., 19 Aug 2025). PREGEN's layerwise pooling over all VLM layers achieves a recall nearly at the theoretical maximum for curated data (Serussi et al., 20 Jan 2026).

6. Challenges, Extensions, and Open Problems

Multiple axes of complexity drive current research:

Fine-grained Compositionality: Subtle actions, spatial region selection, temporal order, and pronoun reference require hierarchical fusion, uncertainty modeling, and cross-modal interaction modules (Chen et al., 2 Dec 2025).
Dense Captioning and Modification: Longer, multi-sentence modifications and detailed video descriptions are necessary for fine semantic control (Thawakar et al., 19 Aug 2025).
Temporal Reasoning: Benchmarks such as EgoCVR and TF-CoVR emphasize retrieving segments based on subtle action, duration, and event changes. Motion-sensitive video encoders and action-class pretraining are critical (Gupta et al., 5 Jun 2025, Hummel et al., 2024).
Zero-shot and Cross-domain Generalization: Transfer to image retrieval (CoIR), multi-modal retrieval, and textual-only or frame-only queries demonstrates flexibility. Strategies include synthetic triplet generation, pseudo-labeling, and knowledge integration (Zhang et al., 3 Mar 2025).
Scalability and Training Efficiency: Frameworks using frozen LLM backbones, LoRA, and lightweight adapters substantially reduce training cost without sacrificing accuracy (Halbe et al., 17 Jan 2026, Serussi et al., 20 Jan 2026).
Annotation and Data Quality: Auto-generated triplets are noisy (~22% discarded in WebVid-CoVR), requiring filtering and high-quality captioning tools. Dense-WebVid-CoVR addresses this via human verification (Thawakar et al., 19 Aug 2025).

7. Outlook and Future Directions

Current trends point to several research frontiers:

End-to-end video LLMs for compositional editing and retrieval, capturing interaction between audio, text, and vision (Zheng et al., 23 Jan 2026);
Interactive CoVR systems capable of multi-turn refinement and iterative query enhancement;
Joint retrieval and moment localization, grounding composed queries in both retrieval ranking and exact segment boundaries;
Expansion of fine-grained CoVR beyond appearance-centric or egocentric domains into high-motion, multi-agent domains;
Unified representations spanning images, videos, and text using generalized contrastive objectives and cross-modal learning (Lee et al., 30 Sep 2025).

Recent empirical findings confirm state-of-the-art performance from PREGEN (Serussi et al., 20 Jan 2026), extensive gain from enriched context and discriminative alignment (Thawakar et al., 2024), robust temporal handling via TF-CoVR-Base (Gupta et al., 5 Jun 2025), and new scaling pathways from LoRA-based models (Halbe et al., 17 Jan 2026). CoVR continues to serve as the core methodological bridge uniting structured compositional search with scalable multimedia understanding in modern video retrieval systems.