Consistent Video Retrieval (CVR)

Updated 4 July 2026

Consistent Video Retrieval (CVR) is a retrieval paradigm that requires results to be semantically relevant while preserving temporal and contextual consistency.
It encompasses methods such as composed video retrieval with modification texts and context-aware sequential retrieval based on visual history.
Recent approaches leverage benchmark-driven evaluations, query reformulation, and interactive feedback to improve state transition modeling and retrieval accuracy.

Consistent Video Retrieval (CVR) denotes a family of retrieval problems in which the target is required to satisfy not only semantic relevance but also a constraint of consistency under modification, temporal evolution, or interaction. Across papers, the term is used in two closely related senses. In one sense, largely overlapping with Composed Video Retrieval (CoVR), the query is a reference video plus a modification text, and retrieval must preserve the reference content except for the instructed change. In another sense, CVR is formalized as context-aware sequential retrieval, where the current result must remain compatible with prior visual history, latent state evolution, or an interaction trajectory. These formulations converge on a shared technical concern: retrieval quality is no longer determined by isolated text–video similarity alone, but by whether the retrieved clip is the correct continuation, transformation, or refinement of a specific visual context (Hummel et al., 2024, Liu et al., 9 Mar 2026, Zhang et al., 11 May 2026).

1. Formal problem statements and conceptual scope

In the composed-retrieval formulation, the query is a pair consisting of a reference/query video $q_v \in \mathcal{V}$ and a text instruction $q_t \in \mathcal{T}$ describing how the reference video should be modified. The gallery is $\mathcal{D} = \{v_1,\dots,v_n\}$ , and retrieval is defined by a scoring function

$\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$

In the generic embedding-space formulation, the query video and text are encoded by video/text encoders $\Psi_v$ and $\Psi_t$ , fused into a joint query representation $q_{v,t}$ using a fusion function $\Psi_q$ , and then compared to candidate video embeddings using cosine similarity. The retrieval objective is to return the gallery element with the highest score under $\Phi$ (Hummel et al., 2024).

In the sequential-consistency formulation, standard retrieval is written as

$v^* = \arg\max_{v \in \mathcal{V}} \mathrm{sim}(f_t(q), f_v(v)),$

where a text query $q_t \in \mathcal{T}$ 0 is matched to a clip $q_t \in \mathcal{T}$ 1 using frozen text and video encoders. CVR replaces this independent-query assumption with

$q_t \in \mathcal{T}$ 2

where $q_t \in \mathcal{T}$ 3 is the current instruction or step text and $q_t \in \mathcal{T}$ 4 is the recent visual history. Under this definition, the retrieved clip must be both semantically correct and consistent with the evolving visual state and identity in the prior clips (Liu et al., 9 Mar 2026).

A third formulation emphasizes implicit consequences. CoVR-R defines the task as: “Given a reference video $q_t \in \mathcal{T}$ 5 and a modification text (``edit'') $q_t \in \mathcal{T}$ 6, the goal is to retrieve a target video $q_t \in \mathcal{T}$ 7 from a gallery $q_t \in \mathcal{T}$ 8 that best reflects the after-effects implied by $q_t \in \mathcal{T}$ 9.” The paper explicitly treats object state transitions, temporal phase progression, scene/background changes, camera or shot-scale changes, and tempo or pacing changes as part of the retrieval target rather than incidental byproducts (Thawakar et al., 20 Mar 2026).

Taken together, these formalizations show that “consistency” in CVR can refer to at least three constraints: preservation of unchanged content under an edit, compatibility with prior visual context, and faithfulness to causal or temporal after-effects implied but not literally stated. This suggests that CVR is best understood as a family of constrained retrieval problems rather than a single fixed benchmark task.

2. Benchmarks, datasets, and evaluation protocols

Recent CVR research is strongly benchmark-driven, and the reported datasets differ substantially in what kind of consistency they test.

Benchmark	Task type	Reported characteristics
EgoCVR	Fine-grained egocentric composed retrieval	2,295 queries; 1,250 long-form egocentric videos; clips are 2–8 seconds, average 7.9 seconds; 10,522 distractor clips; 78.9% temporal and 21.1% object-centered
CAST benchmark	Context-aware sequential retrieval	YouCook2: 414 videos, 3,179 step-clips, 2,765 queries; COIN: 2,134 videos, 6,241 step-clips, 4,107 queries; CrossTask: 509 videos, 2,731 step-clips, 2,222 queries
CoVR-Reason	Reasoning-heavy composed retrieval	2,800 high-quality triplets; built from Dense-WebVid-CoVR and Something-Something V2; each triplet has structured traces $\mathcal{D} = \{v_1,\dots,v_n\}$ 0

EgoCVR was introduced because the existing CVR benchmark WebVid-CoVR was described as too easy and too biased toward object-centric changes. EgoCVR is built from Ego4D Forecasting Hand and Object clips, and query-target pairs are sampled from the same long video, making the modification subtle and fine-grained. Each query has on average 1.2 ground-truth targets, and the benchmark includes 10,522 distractor clips averaging 4.2 distractors per target. The paper reports that 78.9% of EgoCVR queries are temporal and only 21.1% are object-centered, whereas WebVid-CoVR-Test is about 85% object-centered. This benchmark shift is central to the recent literature because it reorients evaluation from object replacement to temporal video understanding (Hummel et al., 2024).

EgoCVR also defines two search regimes. In global search, the gallery includes all relevant candidate clips, including many distractors; each query has at least 10,661 candidate videos and up to 12,526, with metrics Recall@1, Recall@5, and Recall@10. In local search, the gallery is restricted to clips from the same source video; each query has up to 10 clips, average 6.4, with metrics Recall@1, Recall@2, and Recall@3. If a query has multiple ground-truth videos, retrieving any one of them counts as correct (Hummel et al., 2024).

The CAST benchmark uses a fixed 1-vs-9 multiple-choice ranking task with 1 ground-truth clip and 9 negatives total. Negatives are explicitly typed: state negatives come from the same video but the wrong step or state; identity negatives come from different videos but are semantically similar; easy negatives are random distractors. The pool is filled with up to 3 state negatives and up to 3 identity negatives, with any remaining slots filled by easy negatives. Evaluation reports Accuracy, Mean Rank, State Acc., and Ident. Acc., thereby separating semantic relevance from procedural and identity consistency (Liu et al., 9 Mar 2026).

The interactive line extends evaluation to multi-turn retrieval. ReCoVR evaluates on WebVid-CoVR-Test with 2,556 manually annotated triplets, Dense-WebVid-CoVR-Test with richer modification texts, FineCVR-Test with 10,043 queries, FashionIQ, and interactive text-to-video transfer tasks MSRVTT-1kA and AVSD-1k-Test. In addition to Recall@K, it reports BRI, Best log Rank Integral, where lower is better and which measures cumulative ranking quality over turns (Zhang et al., 11 May 2026).

A recurring misconception in this literature is that strong performance on WebVid-CoVR implies strong temporal understanding. EgoCVR was constructed specifically to test that assumption and reports the opposite pattern: many queries that can be solved from a single image in object-centric benchmarks become difficult when the distinction lies in action, event order, or fine-grained hand-object interaction (Hummel et al., 2024).

3. Methodological families

One major methodological family is training-free query reformulation. TF-CVR first captions the query video using a video captioning model $\mathcal{D} = \{v_1,\dots,v_n\}$ 1, producing

$\mathcal{D} = \{v_1,\dots,v_n\}$ 2

A LLM $\mathcal{D} = \{v_1,\dots,v_n\}$ 3 then combines the caption and the modification instruction into a target caption

$\mathcal{D} = \{v_1,\dots,v_n\}$ 4

where $\mathcal{D} = \{v_1,\dots,v_n\}$ 5 is a prompt containing instructions and in-context examples. Retrieval is then reduced to ordinary text-video matching: $\mathcal{D} = \{v_1,\dots,v_n\}$ 6 TFR-CVR adds a visual filtering stage that selects the $\mathcal{D} = \{v_1,\dots,v_n\}$ 7 most visually similar candidates before applying TF-CVR, yielding a generic re-ranking framework for composed video retrieval (Hummel et al., 2024).

A second family is reasoning-first retrieval. CoVR-R uses frozen Qwen3-VL-8B in a zero-shot pipeline. Given $\mathcal{D} = \{v_1,\dots,v_n\}$ 8, the model first generates a structured reasoning trace

$\mathcal{D} = \{v_1,\dots,v_n\}$ 9

then conditions on $\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 0 to produce a target description of the hypothetical post-edit video. Gallery videos are also described and embedded offline, and candidates are ranked by cosine similarity between pooled query and gallery embeddings. The stated motivation is that literal keyword matching is insufficient when edits imply after-effects such as “raw $\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 1 browned,” “before $\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 2 after,” or “close-up implies tighter framing and shorter duration” (Thawakar et al., 20 Mar 2026).

A third family models state transition explicitly. CAST predicts the next-state embedding as

$\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 3

with

$\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 4

The instruction-conditioned branch uses the current instruction and anchor clip, while the temporal context branch uses multi-head cross-attention over the longer visual history. CAST is trained on top of frozen embedding spaces and used as a query-side reranker with the ensemble score

$\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 5

where $\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 6 is text–video similarity, $\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 7 is continuity with the last observed clip, and $\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 8 is compatibility with the predicted next state (Liu et al., 9 Mar 2026).

A fourth family focuses on multimodal query calibration under information-density asymmetry. HUD argues that the video modality usually contains much richer semantics than text and identifies two failures in existing systems: modification subject referring ambiguity and limited detailed semantic focus. Its architecture comprises Holistic Pronoun Disambiguation, Atomistic Uncertainty Modeling, and Holistic-to-Atomistic Alignment, combining overlapping semantics through holistic cross-modal interaction with fine-grained semantic alignment via atomistic-level interaction (Chen et al., 2 Dec 2025).

ReTrack treats composed-query bias as a geometric calibration problem. It identifies modal contribution entanglement, explicit optimization of composed features, and retrieval uncertainty as the three core challenges. Its modules—Semantic Contribution Disentanglement, Composition Geometry Calibration, and Reliable Evidence-driven Alignment—construct dual directional anchors and regularize composed-to-target similarity with bidirectional evidence derived from Dempster-Shafer Theory, Subjective Logic, and Evidential Deep Learning (Li et al., 20 Apr 2026).

IMAGINE addresses implicit semantics that are not explicitly visible but are semantically evoked by the scene, distinguishing a concrete space from an imagery space. Its modules—Schema Imagery Construction, Imagery-guided Multimodal Composition, and Dual Space Alignment—build dynamic multimodal prototype libraries, derive shared imagery vectors, modulate the composed feature with confidence-weighted $\Phi: \mathcal{V}\times\mathcal{T}\times\mathcal{D}\rightarrow \mathbb{R}.$ 9 and $\Psi_v$ 0 parameters, and jointly align explicit query-target correspondence with imagery-level correspondence (Huang et al., 6 Jun 2026).

These families are not mutually exclusive. A plausible implication is that recent CVR systems differ less in the use of a single retrieval backbone than in where they locate the missing signal: in language reformulation, structured reasoning, latent state transition, query disambiguation, geometric calibration, or implicit schema induction.

4. Empirical findings and reported performance

EgoCVR reports a sharp generalization gap for existing methods. In the global setting, CLIP has R@1 = 0.7, BLIP 0.4, LanguageBind 0.9, and EgoVLPv2 1.7. Methods explicitly designed for WebVid-CoVR also transfer poorly: BLIP $\Psi_v$ 1 has R@1 = 5.4, BLIP $\Psi_v$ 2 6.0, and CIReVL 2.0. The paper states that text-only retrieval is weak, visual-only retrieval is also insufficient, simple fusion is not enough, and methods fine-tuned on WebVid-CoVR do not transfer because EgoCVR requires true temporal understanding. The best reported method is TFR-CVR with global R@1 = 14.1, R@5 = 39.5, and R@10 = 54.4; in the local setting it reaches R@1 = 44.2, R@2 = 61.0, and R@3 = 73.2. The same paper also reports that predicted caption reformulation helps: for TFR-CVR in the global setting, instruction-only yields R@1 = 12.8, predicted caption 14.1, and ground-truth caption 18.5 (Hummel et al., 2024).

CAST reports large gains over context-free baselines on the procedural benchmark. With CLIP-B/32 as the frozen backbone, CAST reaches 44.77 Accuracy on YouCook2 versus 25.03 for the CLIP baseline, 40.47 versus 14.10 on COIN, and 47.39 versus 16.83 on CrossTask. The paper also reports strong transfer across frozen backbones, including InternVideo2 on YouCook2 from 36.75 to 71.68, InternVideo2 on CrossTask from 20.61 to 64.36, and Qwen3-VL-Embedding on YouCook2 from 33.45 to 56.64. The improvement is described as especially strong for state accuracy and identity accuracy, indicating that the gains are not reducible to generic semantic matching alone (Liu et al., 9 Mar 2026).

CoVR-R reports the strongest performance on its reasoning-heavy benchmark with the reasoning-enabled variant. On CoVR-R, “Our Approach + R” achieves R@1 = 49.88, R@5 = 66.99, R@10 = 72.97, R@50 = 85.14, and a reasoning score of 8.31 ± 0.098, compared with 44.32, 61.91, 67.33, 79.90, and 7.46 ± 0.11 for the non-reasoned variant. On Dense-WebVid-CoVR, “Our Approach + R” reaches R@1 = 61.21 and R@50 = 97.61, while the strongest prior baseline BSE-CoVR is reported at R@1 = 48.08 and R@50 = 93.78. The paper repeatedly states that the largest gains occur on implicit-effect subsets, including state transitions, temporal phase changes, camera or shot changes, and causally implied visual consequences (Thawakar et al., 20 Mar 2026).

On the more established WebVid-CoVR benchmark, several recent architectures cluster in a relatively narrow band but all outperform older baselines. HUD reports R@1 63.38, R@5 86.93, R@10 92.29, R@50 98.76, Avg. 85.34. ReTrack reports R@1 63.85, R@5 87.05, R@10 92.80, R@50 99.10, Avg 85.70. IMAGINE reports R@1 = 63.51, R@5 = 87.26, R@10 = 92.72, R@50 = 99.03, Avg = 85.63. Each paper presents its result as state of the art on WebVid-CoVR and attributes the gain to a different missing component: pronoun disambiguation and atomistic uncertainty in HUD, directional bias correction and evidence-driven alignment in ReTrack, and imagery-space guidance in IMAGINE (Chen et al., 2 Dec 2025, Li et al., 20 Apr 2026, Huang et al., 6 Jun 2026).

A notable pattern across these results is that benchmark difficulty changes the apparent ranking of methods. This suggests that reported gains on WebVid-CoVR and gains on EgoCVR or CoVR-R should not be treated as interchangeable evidence for the same capability.

5. Interactive and closed-loop retrieval

ReCoVR reframes the problem as Interactive Composed Video Retrieval (ICoVR), a multi-turn extension of CoVR in which a user progressively refines intent through natural-language feedback. The interaction is formalized as

$\Psi_v$ 3

where $\Psi_v$ 4 is an optional system clarifying question, $\Psi_v$ 5 is the user’s feedback, and $\Psi_v$ 6 is the updated ranked list at turn $\Psi_v$ 7. The paper identifies two weaknesses in adapted interactive baselines: reliance on a single retrieval channel and an open-loop retrieval design that consumes user feedback but does not diagnose whether its own retrieval trajectory is drifting or stagnating (Zhang et al., 11 May 2026).

ReCoVR addresses these weaknesses with a dual-pathway, training-free closed-loop framework. The Intent Pathway decomposes feedback into a target-like description and a relative edit,

$\Psi_v$ 8

then retrieves through two channels,

$\Psi_v$ 9

The Reflection Pathway produces a satisfaction signal and a constraint delta,

$\Psi_t$ 0

followed by a rank-cap policy

$\Psi_t$ 1

which demotes explicitly rejected or stagnant candidates. Final ranking uses Time-Weighted Reciprocal Rank Fusion with recency weights

$\Psi_t$ 2

The system maintains Progress Memory and Result Memory, and the paper explicitly interprets retrieval history as diagnostic evidence rather than passive storage (Zhang et al., 11 May 2026).

On WebVid-CoVR-Test, the reported Turn 0 R@1 is 54.30%. After one interactive round, ReCoVR reaches 74.30% R@1, then 80.32% at Turn 2, 84.39% at Turn 3, 87.72% at Turn 4, and 90.22% at Turn 5. At Turn 1, the adapted baselines IVR, Merlin, and UMIVR obtain 66.55%, 60.80%, and 67.10%, respectively, versus 74.30% for ReCoVR. BRI reaches 0.3643 at Turn 5, compared with 0.4700 for UMIVR. The paper frames these improvements in terms of avoiding error accumulation, query drift, stagnation, and repeated failures (Zhang et al., 11 May 2026).

This multi-turn setting makes the term “consistent” literal: the retrieval system must preserve already satisfied constraints while incorporating new ones. A common misconception is that interactivity merely adds more query tokens. In ReCoVR’s formulation, interactivity instead creates a trajectory-level optimization problem in which memory, diagnosis, and correction become part of retrieval itself.

6. Extensions beyond standard retrieval and broader implications

The consistency principle has begun to migrate beyond standard ranking benchmarks. I3DM treats long-video generation with revisitation as a CVR-style memory problem. The model maintains a memory bank of historical frames and camera representations,

$\Psi_t$ 3

and, for a new target view, retrieves $\Psi_t$ 4 relevant historical frames plus the last frame as an anchor. Retrieval is based on implicit 3D-aware relevance derived from shallow features of a frozen feed-forward novel view synthesis model, specifically LVSM, with patch-level uncertainty and greedy maximum coverage selection. Retrieved frames are then injected through a 3D-aligned memory injection module. On RealEstate10K, the paper reports FID 17.553, FVD 131.657, rotation error 1.991, translation error 0.0505, PSNR 24.732, SSIM 0.828, and LPIPS 0.0756, emphasizing revisit consistency, generation fidelity, and camera control precision (Li et al., 24 Mar 2026).

A related benchmark-design extension appears in ZeroSight, which studies genuine zero-shot composed image retrieval using consistent video-sourced data. Although the task is image retrieval, its construction explicitly imports the CVR intuition that reference and target should come from a single coherent visual source. ZeroSight uses 12,048 videos, 197,313 candidate images in the retrieval pool, and 54,740 queries, all sourced from videos published after March 31, 2022. Each query has multiple positive targets and hard negatives, and the paper argues that prior metrics inflate performance because they do not penalize negatives ranked above positives. Across 27 methods, the average mAP is reported as 19.94 and the average PNR-mAP as 16.22, corresponding to a 22.93% inflation if one looks only at mAP. Its training-free reranker, SC4CIR, uses three symmetric consistency checks—forward retrieval plus two reverse processes—and is described as plug-and-play (Yang et al., 5 Jun 2026).

These extensions indicate that CVR is increasingly treated as a systems property rather than only a query encoder problem. A plausible implication is that future progress will depend as much on benchmark construction, memory access, trajectory diagnosis, and evaluation design as on improvements to single-shot multimodal fusion. The literature already points in that direction: EgoCVR shifts evaluation toward temporal action understanding, CAST introduces state-conditioned sequential retrieval, ReCoVR closes the loop over interaction history, CoVR-R inserts explicit causal and temporal reasoning, and I3DM embeds retrieval into long-horizon generative memory (Hummel et al., 2024, Liu et al., 9 Mar 2026, Zhang et al., 11 May 2026, Thawakar et al., 20 Mar 2026, Li et al., 24 Mar 2026).