Dense-WebVid-CoVR: Compositional Video Retrieval
- The paper introduces a large-scale benchmark with 1.6M triplets enriched by dense modifications (avg 31 words) and detailed video descriptions to advance fine-grained retrieval.
- It employs a unified cross-attention fusion model that integrates visual and textual embeddings, achieving a Recall@1 of 71.3%, a 3.4pp improvement over previous methods.
- Dense-WebVid-CoVR offers full-split evaluations and robust ablations, supporting diverse use cases in video editing, surveillance, and educational content selection.
Dense-WebVid-CoVR defines a large-scale benchmark for composed video retrieval, with 1.6 million triplets annotated with dense, fine-grained modification instructions and textual video descriptions. It is accompanied by a unified cross-attention fusion retrieval model that achieves state-of-the-art performance in the fine-grained setting, leveraging advanced grounding between visual, descriptive, and modification modalities. The dataset and methods are introduced in "Beyond Simple Edits: Composed Video Retrieval with Dense Modifications" (Thawakar et al., 19 Aug 2025).
1. Dataset Design and Construction
Dense-WebVid-CoVR builds on the WebVid-CoVR foundation by enriching each triplet of (source video, basic change text, target video) with substantially more detailed annotations. Its central innovation is the incorporation of densely-phrased modification texts, with an average of 31.16 words per modification (compared to 4.6 in WebVid-CoVR), and comprehensive video descriptions averaging 81.32 words (WebVid-CoVR: 6.68 words). The source data are mined from WebVid-2M/8M, resulting in 1.6 million triplets spanning approximately 131,000 unique videos and 467,000 unique change texts.
Automatic annotation proceeds in two stages: Gemini-Pro generates initial dense descriptions (which are filtered via a BLIP-based hallucination checker, cosine threshold ≈ 0.4), and GPT-4o provides paragraph-level dense modification instructions, informed by the source/target captions and original change text as context. Manual verification covers the entirety of the validation and test splits (3,000 triplets each), as well as 100,000 representative training triplets. Remaining training data are estimated to exhibit only 2–3 % minor annotation noise. Quality control emphasizes temporal and object/action alignment, conciseness, and fidelity.
The dataset offers full train/val/test splits (train: 1,594,000; val: 7,000; test: 3,200). Clips average 16.8 seconds in duration and cover a broad array of domains, including nature, lifestyle, sports, education, and professional scenarios.
2. Comparative Analysis with Previous CoVR Datasets
Dense-WebVid-CoVR advances prior Composed Video Retrieval (CoVR) resources in both scale and granularity. Unlike WebVid-CoVR (1.6 million triplets, ~5 word change texts) or Ego-CVR (2,300 triplets, ~4 word test-only change texts, egocentric focus), the new benchmark is distinguished by its annotation density (∼31-word modifications), full-split structure, and category breadth. Richer editing language introduces increased retrieval challenge: dense descriptions and modifications generate more informative negative distractors, demanding that models resolve subtle distinctions (e.g., "white brick background" with "red flames" versus generic action prompts). This shift emphasizes compositional understanding and precise grounding of fine-grained scene attributes (Thawakar et al., 19 Aug 2025).
3. Text and Visual Representation Pipeline
Dense-WebVid-CoVR employs a modular multi-modal embedding approach:
- Vision Encoder : Based on a pretrained ViT-Large, processes the middle frame of each video for efficiency, outputting a dimensional visual embedding.
- Text Encoder : Utilizes BLIP-2 (frozen), inputs dense video descriptions, and transforms via a linear projection head to match the visual encoder’s -dimensional space.
- Fusion Operation: Embeddings from vision and text are combined via a learnable scalar , resulting in the fused embedding , with tuned on validation (≈ 0.36).
This structure enables explicit alignment of rich textual descriptions with visual content, optimizing for discriminability in high-density compositional retrieval tasks.
4. Unified Cross-Attention Fusion Model
The retrieval model introduces a unified cross-attention (CA) grounding architecture, addressing context dilution observed in prior pairwise fusion schemes (q–t, d–t separately). The core is a transformer-based grounding text encoder, which aggregates the fused (, ) representation (key/value) and dense modification text (query). The transformer consists of repeating blocks with:
- Self-attention on 0 tokens,
- Cross-attention: 1 = 2 token features; 3, 4 = fused embedding,
- Feed-forward layers,
- LayerNorm and residual connections.
The cross-attention is applied as
5
where 6 is the feature dimension. Stacked for 7 (e.g., 6) layers, the output is mean-pooled or extracted from the [CLS] token to yield 8.
Training employs a bi-directional contrastive loss with hard-negative Noise Contrastive Estimation (HN-NCE), using a batch size of 1024, temperature 9, weighting coefficient 0, and 1. Backbone encoders are frozen; only the grounding encoder is updated (AdamW, learning rate 2, five epochs, no auxiliary losses).
5. Benchmarking, Evaluation, and Ablation
Evaluation employs Recall@k (R@k) and mean rank metrics, querying the entire test set (3,200 videos) in text-only, visual-only, and visual+text (q+d+t) modes. The unified CA fusion model achieves 71.26 % Recall@1 in the visual+text setting, a 3.4 pp improvement over the prior best (Thawakar et al., pairwise CA, 67.86 %). Zero-shot transfer from WebVid-CoVR weights yields R@1 = 48.08 % (vs. 39.20 % prior: +8.9 pp).
Ablations highlight the contributions of dense annotation: replacing dense modifications with shorter alternatives at inference reduces R@1 (71.26 % → 68.88 %); omitting dense descriptions 3 lowers R@1 to 66.08 %. Fusion strategies are compared (simple average: 69.7 %, CA-only: 70.1 %, unified CA: 71.3 %).
The model generalizes to auxiliary benchmarks: on Ego-CVR, global zero-shot R@1=14.6 % (vs. 14.1 % TFR-CoVR), local R@1=44.8 % (vs. 44.2 %); on CIRR (image CoIR), fine-tuned R@1=56.3 % (vs. 51.0 %), zero-shot R@1=44.1 % (vs. 40.1 %). Consistent improvements appear on FashionIQ (Dress/Shirt/Top tee) in both fine-tuned and zero-shot settings (Thawakar et al., 19 Aug 2025).
| Dataset | #Triplets | Avg. mod. text length | Verification |
|---|---|---|---|
| WebVid-CoVR | 1.6M | ~5 words | Automatic |
| Ego-CVR | 2.3K (test) | ~4 words | Test set only |
| Dense-WebVid-CoVR | 1.6M | ~31 words | Full val/test + 100K train |
6. Use Cases, Limitations, and Prospective Extensions
Dense-WebVid-CoVR supports compositional video retrieval tasks requiring high-resolution action or scene differentiation. Application domains include video editing (locating subtle scene variants), sports highlight identification (“same play with different gesture”), surveillance (“same camera, modified object”), and educational content selection (“repeat demonstration from new perspective”).
Current limitations stem from the use of a single video frame in 4, underutilizing temporal information; annotation noise (estimated 2–3 %) in unverified training samples; and monolingual (English) focus, limiting direct applicability in low-resource language settings.
Proposed directions include full video feature/transformer backbones, multilingual and domain-specific modification text generation, iterative interactive refinement for user-cooperative retrieval, and architectural optimization for longer videos and multi-segment retrieval (Thawakar et al., 19 Aug 2025).
7. Significance and Impact
Dense-WebVid-CoVR establishes a new standard for richly annotated, large-scale video retrieval resources. Its dense modification instructions and unified fusion model set a high bar for fine-grained, compositional video understanding and retrieval tasks. The dataset’s annotation rigor, combined with the step-change in retrieval accuracy (R@1=71.3 %, +3.4 pp over prior best), provides a valuable testbed and methodological foundation for advancing multi-modal video AI, compositional scene understanding, and fine-grained edit localization (Thawakar et al., 19 Aug 2025).