Query-Based Frame Selection
- Query-Based Frame Selection is a method that selects the most relevant video frames based on textual queries to enhance accuracy in retrieval, QA, and reasoning tasks.
- It leverages techniques such as text-guided ranking, submodular optimization, and reinforcement learning to ensure diverse, temporally coherent frame selection.
- Empirical studies show these methods can dramatically reduce frame processing while improving performance by up to 6.9%–8.5% over uniform sampling.
Query-Based Frame Selection refers to the principled identification of video frames that are most relevant to a given textual query (e.g., a question or prompt), under the context and computational constraints inherent to Large Multimodal Models (LMMs) and Video-LLMs. Instead of uniformly subsampling frames, query-based methods optimize for maximal task performance (retrieval, question-answering, reasoning) by selectively exposing only those frames whose visual content substantively supports answering the query. This paradigm encompasses techniques ranging from simple text-guided ranking, submodular/max-margin combinatorics, deep policy learning, and set-level structured selection, with strong empirical evidence for their superiority over naive sampling on benchmarks spanning short and long-form video understanding.
1. Formal Problem Definition and Motivation
Let a video be represented as an ordered sequence of frames, , and let denote the associated textual query. In most practical scenarios, constraints on visual token budgets (e.g., context window, GPU memory) admit only a small set , frames for model input. The central objective is to select such that, when processed by a Video-LLM or retrieval engine (denoted ), the probability of obtaining the correct output (e.g., answer, retrieval result) is maximized, ideally matching the performance given access to the full video. This is formalized as:
where quantifies accuracy, confidence, or utility relevant to the downstream task (Lee et al., 2 Jun 2025).
2. Taxonomy and Core Principles
Query-based frame selection methods can be systematized along several axes:
- Heuristic vs. Learning-based: Early systems (e.g., Maximal Marginal Relevance (Vasudevan et al., 2017); CLIP cosine matching) rely on static feature-based similarity. Modern architectures utilize submodular optimization, deep learning, or reinforcement learning to adapt selection criteria to both query and downstream task utility (Patil et al., 12 Jan 2026, Lee et al., 2 Jun 2025).
- Text-free vs. Text-guided: Text-free methods select frames agnostic to query semantics, typically via uniform sampling, clustering, or frame quality scoring (Wu et al., 2023). Text-guided methods align frame embeddings with query embedding, employing either pointwise similarity or more structured approaches (e.g., DPP, SMI, RL) (Patil et al., 12 Jan 2026, Sun et al., 6 Jan 2025).
- Independent vs. Set-level Selection: Scoring frames independently risks temporal redundancy; structured approaches (DPP, Gumbel-Softmax set objective, RL policies) enforce diversity, sequentiality, and logical coverage at the set level (Yang et al., 12 Dec 2025, Sun et al., 6 Jan 2025).
Key design principles in state-of-the-art methods include:
- Query Relevance: Frames must be semantically aligned with the query, as measured in shared embedding spaces or via direct reward from reference LMMs (Patil et al., 12 Jan 2026).
- List-wise Diversity and Redundancy Avoidance: Selected frames should not be visually/temporally redundant; diversity is enforced via submodular objectives (e.g., DPP, SMI) (Sun et al., 6 Jan 2025, Patil et al., 12 Jan 2026).
- Temporal Coherence and Sequentiality: Ordering and spread are often constrained to sample both early and late events (Sun et al., 6 Jan 2025).
- Task-driven Supervision or Reward: Selection objectives are increasingly coupled to the reasoning performance of the downstream model, either via margin-based RL rewards (Lee et al., 2 Jun 2025), teacher-student alignment (Yang et al., 12 Dec 2025), or direct model feedback (Yu et al., 2024).
3. Algorithmic Methodologies
Pointwise and Simple Text-Guided Selection
The simplest query-based protocols extract frame embeddings (ViT, CLIP) and query embeddings (BERT, CLIP), scoring each frame via cosine similarity to the query and selecting top-K (Wu et al., 2023, Zhang et al., 27 Jun 2025):
Top-K frames by are chosen (Wu et al., 2023).
Submodular and List-wise Set Selection
Submodular methods, such as Facility-Location Mutual Information (FLMI) and Graph-Cut Mutual Information (GCMI) (Patil et al., 12 Jan 2026), combine relevance and diversity:
where is similarity. The greedy algorithm yields a -approximation for monotone submodular objectives.
Determinantal Point Processes and Sequential Allocation
MDP³ employs RKHS-based conditional similarity matrix , DPP selection for set-level diversity/relevance, and dynamic programming for segment-wise allocation, offering tractable -approximate list-wise selection under sequential constraints (Sun et al., 6 Jan 2025).
Reinforcement Learning of Selection Policies
ReFoCUS reframes selection as a sequential policy learning task:
with action space over frame indices, autoregressive conditional selection enforcing temporal coherence, and reward signals derived from margin-based LMM outputs. Policy gradient and entropy regularization are applied, with batch-wise baseline subtraction (Lee et al., 2 Jun 2025).
End-to-End Differentiable Selection
VidF4 and HFS leverage Gumbel-Softmax relaxation to enable differentiable selection, allowing frame scoring heads to be trained alongside QA objectives (Liang et al., 2024, Yang et al., 12 Dec 2025). The set-level selection objective aggregates relevance, coverage, and redundancy in a continuous fashion, and teacher reasoning output is aligned with student selector distributions via KL-divergence (Yang et al., 12 Dec 2025).
Clip-Level and Sequential Exploration
FOCUS casts keyframe selection as combinatorial pure-exploration bandit, partitioning videos into clips ("arms"), estimating empirical mean relevance per arm with Bernstein confidence bounds, and then allocating selection budget via two-stage exploration-exploitation (Zhu et al., 31 Oct 2025).
Adaptive, Iterative, and Reasoning-based Selection
A.I.R. applies iterative refinement: (i) event detection by thresholding CLIP scores, (ii) proportional allocation to detected "events", (iii) ranking intervals of candidate frames by potential scores, and (iv) per-interval reasoning-based relevance confirmation using VLM chain-of-thought scoring (Zou et al., 6 Oct 2025).
4. Practical Implementations, Computational Efficiency, and Limitations
Selection mechanisms are typically deployed as plug-and-play preprocessing modules ahead of downstream Video-LLM pipelines. Training-free approaches (CLIP, DINOv2, set-based greedy) dominate in scenarios requiring minimal integration effort (Zhang et al., 27 Jun 2025, Li et al., 3 Dec 2025, Zhu et al., 31 Oct 2025). More advanced frameworks support curriculum or end-to-end training, leveraging proxy similarity, leave-one-out loss, and dataset-scale annotations (Li et al., 4 Oct 2025).
Efficiency is a recurring theme: state-of-the-art methods process less than 2% of frames (FOCUS), offering order-of-magnitude reductions in FLOPs and latency compared to uniform or baseline methods (Zhu et al., 31 Oct 2025, Li et al., 4 Oct 2025, Zhang et al., 27 Jun 2025). Adaptive selection (FrameOracle, A.I.R.) flexibly predicts both which frames and how many frames to select based on question complexity and information density (Li et al., 4 Oct 2025, Zou et al., 6 Oct 2025).
Limitations include reliance on frozen backbone encoders (CLIP as zero-shot), potential failure to capture fine temporal dependencies (Q-Frame, FOCUS), and robustness to query type and semantic ambiguity (DIG). Teacher-student alignment and reasoning-based scores help mitigate weak pseudo-label supervision, but feature quality and temporal modeling remain bottlenecks for certain QA categories (Yang et al., 12 Dec 2025, Li et al., 3 Dec 2025).
5. Empirical Results and Benchmarks
Query-based frame selection consistently outperforms uniform/random sampling and naïve frame ranking across diverse benchmarks (Video-MME, LongVideoBench, MLVU, NExT-QA, MVBench):
- Accuracy improvements up to +6.9% (TCS (Tan et al., 16 Jan 2026)), +8.5% (Q-Frame (Zhang et al., 27 Jun 2025)), +3.9% (ReFoCUS (Lee et al., 2 Jun 2025)), +4% (Patil et al. (Patil et al., 12 Jan 2026)), +3–8% (MDP³ (Sun et al., 6 Jan 2025)), +2.5 pts (VidF4 (Liang et al., 2024)), +3–4 pts (Frame-Voyager (Yu et al., 2024)).
- Enhanced efficiency: e.g., reducing 16-frame inputs to 10.4 frames with no accuracy loss (Li et al., 4 Oct 2025), <2% frame coverage with >5% accuracy gains on hour-long videos (Zhu et al., 31 Oct 2025), or achieving comparable accuracy at ½ the inference cost (TCS (Tan et al., 16 Jan 2026)).
- Structured set-level selection (HFS (Yang et al., 12 Dec 2025)) yields highest aggregate accuracy on object/event localization and complex reasoning tasks, surpassing independent scoring.
- In retrieval, query-guided selection supports Recall@1 preservation and up to 50% reduction in FLOPs (Wu et al., 2023).
Table: Selected results from key approaches
| Method | Domain | Accuracy Gain vs Uniform | Frame Coverage |
|---|---|---|---|
| Q-Frame | Video QA | +8.5% | 8/128 (token-eq) |
| FOCUS | Long Video QA | +4–11.9% | <2% |
| MDP³ | Video QA | +3–8% | 8/128 |
| DIG | Long Video QA | +7.7% | Up to 256 frames |
| FrameOracle | Video QA | +1.4% at 78% frame cut | 13.9/64 |
| VidF4 | Video QA | +2.5 pts | 8/32 |
| HFS | Video QA | +3–7 pts | 16/128 |
| TCS | Long Video QA | +6.9% | 8/32 |
6. Extensions, Variants, and Emerging Directions
Recent research expands frame selection methods along several axes:
- Query Typology Adaptation: DIG demonstrates the need to distinguish global from localized queries, activating query-aware selection only where beneficial (Li et al., 3 Dec 2025).
- Multi-query and Clip-level Sampling: TCS generates multiple queries for complementary aspects of the video, combining dense local selection with sparse global coverage (Tan et al., 16 Jan 2026).
- Reasoning and Teacher-Student Alignment: Holistic set-based frameworks employ chain-of-thought generation, Gumbel-Softmax set relaxation, and online distillation to dynamically shape selection (Yang et al., 12 Dec 2025).
- Structured Knowledge Tasks: FRASE introduces frame semantic role labeling as a means of query-based "frame" selection for semantic parsing in SPARQL generation, demonstrating robustness to unseen templates and paraphrases (Diallo et al., 28 Mar 2025).
Contemporary limitations include:
- Incomplete temporal logic modeling in LMMs post-selection (DIG (Li et al., 3 Dec 2025)).
- Dependence on static frozen encoders; active adaptation to more complex cues (audio, fine-grained motion) remains open.
- Label and supervision quality for training selectors, especially pseudo-label reliability.
7. Conclusion and Outlook
Query-Based Frame Selection has emerged as a foundational operation for efficient, accurate video understanding in multimodal LLMs. Techniques have evolved from heuristic and embedding-based ranking to structured, set-aware, and reward-aligned methods encompassing both training-free and end-to-end differentiable architectures. Empirical evidence demonstrates consistent accuracy gains and latency reductions across standard video reasoning benchmarks, affirming query-aware selection as essential to scalable video-LLMs. Ongoing research explores greater adaptation to query typology, richer multi-modal fusion, and integration with temporal logic modules, all toward closing the gap between what models "see" and what they "need to know" for real-world video comprehension.