Learnable Frame Selector (LFS)
- LFS is a trainable mechanism that selects key video frames based on event relevance, diversity, and temporal coherence for efficient downstream processing.
- Its methodologies include direct scoring with top-K selection, cross-modal fusion, sequential decision processes, set-level objectives, and reinforcement learning.
- Empirical evaluations show that LFS improves accuracy by up to 6.6% while reducing computational costs by processing fewer yet more informative frames.
A Learnable Frame Selector (LFS) is a trainable mechanism integrated into various computer vision and video understanding systems to select the most informative subset of temporal frames for downstream tasks. The LFS paradigm has emerged as a response to the inefficiency and redundancy of uniform frame sampling, and now constitutes a core component in video LLMs (Video-LLMs), video recognition, captioning, and manipulation pipelines across robotics and multimedia applications.
1. Core Principles and Motivation
A Learnable Frame Selector is designed to optimize which video frames are processed by a downstream model, with the objective of maximizing task performance under strict computational constraints or model input token budgets. The rationale is that, due to temporal redundancy and non-uniform event salience in real-world videos, only a small subset of frames carries sufficient information for tasks such as video QA, captioning, recognition, and robotic perception (Hu et al., 27 Feb 2025, Chao et al., 21 Jan 2026, Yu et al., 2024, Wu et al., 2018).
LFS models are differentiated by their learning paradigms (supervised, reinforcement learning, mutual learning, or self-supervised), their integration with vision and language backbones, and their explicit inductive biases toward relevance, diversity, and event awareness.
2. Architectures and Selection Mechanisms
2.1. Direct Scoring and Top-K Selection
Early and influential approaches use lightweight neural networks to produce a per-frame importance score, with final Top-K selection based on these scores. In (Chao et al., 21 Jan 2026), a temporal scoring network (TSNet) processes frame embeddings (from a frozen vision encoder) via temporal convolutions and global gating; scores are normalized, and frames are chosen by stratified Top-K—one from each temporal segment—to ensure event awareness and temporal diversity, mitigating clustering on salient events.
1 |
2.2. Cross-Modal Fusion and Query-Aware Selection
For video question answering and captioning, LFS often conditions scoring on both video content and a user-specified query. Frame-Voyager (Yu et al., 2024) concatenates frame embeddings with query embeddings, processes them through frozen LLM transformer layers, and projects the fused features to a joint reward space. Each frame’s score is the cosine similarity of the projected query vector and the projected frame representation.
Similarly, (Hu et al., 27 Feb 2025) employs an alignment projector and a learnable score-query token, scoring all frames in a single LLM forward pass and applying greedy non-maximum suppression to enforce selection diversity.
2.3. Sequential and Autoregressive Selection
AdaFrame (Wu et al., 2018) and ReFoCUS (Lee et al., 2 Jun 2025) formulate frame selection as a sequential decision process, using LSTM or transformer-based policies. AdaFrame’s LSTM “agent,” augmented by a global memory of context, selects frames adaptively based on past hidden states and utility predictions, training with policy gradient methods and utility-driven early stopping. ReFoCUS introduces an autoregressive, temporally conditioned policy trained with REINFORCE and margin-based rewards from a frozen teacher model, supporting non-myopic subset selection and temporal coherence.
2.4. Set-Level and Submodular Objectives
HFS (Yang et al., 12 Dec 2025) introduces a set-level selection objective, simultaneously optimizing for relevance, temporal coverage, and redundancy reduction. The approach uses continuous Gumbel-Softmax masks for differentiable subset selection, and incorporates a submodular-inspired function:
where relevance, coverage, and redundancy are computed from the soft mask , and their coefficients are tuned for the best accuracy/diversity balance.
2.5. Self-Supervised and Reconstruction-Based Selection
FrameRS (Fu et al., 2023) implements a two-stage pipeline: an MAE-style video frame reconstructor (FrameMAE) is pre-trained self-supervised, and a subsequent MLP-based selector is trained as a classifier over all frame pairs to minimize reconstruction error. The objective is to retain only those frames sufficient for high-fidelity reconstruction, compressing videos while preserving semantic content.
3. Training Strategies and Supervision
3.1. Static Pseudo Labeling
Several selectors generate pseudo-labels by prompting frozen LLMs or MLLMs. For example, in (Hu et al., 27 Feb 2025) spatial importance is derived by prompting a model to label frames as “useful,” while temporal label sets are constructed by obtaining frame captions and prompting a stronger LLM to select the most relevant frames. Pseudo-labels supervise the selector through binary cross-entropy between predicted and target scores.
3.2. Downstream Task Feedback
Some modern selectors are trained end-to-end with feedback from the frozen downstream Video-LLM. In (Chao et al., 21 Jan 2026), the LFS backpropagates gradients from downstream caption cross-entropy loss (or its relative advantage compared to uniform sampling) into the temporal scoring module. This enables the selector to optimize selection directly for the downstream objective (e.g., caption informativeness).
3.3. Reinforcement Learning
ReFoCUS (Lee et al., 2 Jun 2025) and AdaFrame (Wu et al., 2018) use reinforcement learning, formulating frame selection as a Markov decision process. Rewards are computed as either margin improvements (AdaFrame) or answer-margin rewards from a reference LMM (ReFoCUS). Policies are optimized via REINFORCE with entropy regularization.
3.4. Mutual Student-Teacher Distillation
HFS (Yang et al., 12 Dec 2025) employs a joint-learning framework: a student selector (SLM) and a teacher video reasoner (MLLM) are trained with KL alignment between their inferred frame-importance distributions, in addition to standard cross-entropy loss and set-level objectives.
4. Integration with Downstream Models
A defining feature of LFS is compatibility with frozen backbone models, ensuring plug-and-play operation in complex pipelines. The selected frames, or their fused representations, are provided to a downstream captioning/QA module, which remains unchanged (Chao et al., 21 Jan 2026, Yu et al., 2024). This decouples selection from language or vision backbone updates and allows for seamless system upgrades. Some approaches, such as FrameMiners for 3D robotic manipulation, select among coordinate frames rather than time frames, fusing merits of multiple spatial normalizations (Liu et al., 2022).
5. Empirical Evaluation and Quantitative Impact
LFS achieves consistent improvements in both accuracy and efficiency across a broad spectrum of video benchmarks.
| Framework | Dataset | Baseline (Acc %) | +LFS (Acc %) | Gain (%) |
|---|---|---|---|---|
| Frame-Voyager (Yu et al., 2024) | Video-MME | 47.5 | 50.5 | +3.0 |
| LFS (Chao et al., 21 Jan 2026) | ICH-CC (Qwen3) | 71.23 | 75.05 | +3.82 |
| HFS (Yang et al., 12 Dec 2025) | Video-MME (w. subs) | 53.1 | 59.7 | +6.6 |
| M-LLM-LFS (Hu et al., 27 Feb 2025) | NExT-QA | 77.6 | 78.4 | +0.8 |
| AdaFrame (Wu et al., 2018) | FCVID (mAP) | 80.2 | 80.2 (with 1/3 the frames) | - |
Frame selection also consistently reduces the computational cost (e.g., 25 vs. 8.2 average frames per video on FCVID with no mAP loss (Wu et al., 2018)), and leads to higher final returns in tasks such as robotic manipulation (Liu et al., 2022). Ablations confirm the importance of event-awareness, stratification, set-level objectives, and teacher-student alignment.
6. Variants, Extensions, and Domain-Specific Adaptations
LFS mechanisms now span multiple domains:
- 3D Robotics: FrameMiners select among coordinate normalization frames (world, base, end-effector, target-part), fusing per-frame action experts with learned, input-dependent weights. Efficiency and success rates are markedly improved for dual-arm and mobile manipulation (Liu et al., 2022).
- Video Captioning: Integrating event- and diversity-aware LFS modules with frozen LLM captioners improves human-aligned reasoning benchmarks like ICH-CC (Chao et al., 21 Jan 2026).
- Video QA: LFS variants—including query-aware, mutual-learning, and set-level designs—substantially boost performance under strong token budgets, especially in long-form, multi-event videos (Yu et al., 2024, Hu et al., 27 Feb 2025, Yang et al., 12 Dec 2025).
7. Limitations and Future Directions
LFS models incur additional initial computational overhead for per-frame feature extraction and scoring, though this is amortized over the reduction in downstream processing cost. Reliance on pseudo-labels or reward models can introduce alignment errors or biases, especially in complex temporal queries (Hu et al., 27 Feb 2025, Lee et al., 2 Jun 2025). Recent methods address these challenges via end-to-end mutual learning (Yang et al., 12 Dec 2025) and direct caption-guided loss (Chao et al., 21 Jan 2026).
Potentials for extension include joint fine-tuning with backbone models, incorporation of audio/motion cues, exploration of continuous selection (rather than hard Top-K), and dynamic stopping rules. The application of frame selection principles to spatial, part-based, and multimodal (video+audio+text) selection remains an active area of research.
References:
- HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning (Yang et al., 12 Dec 2025)
- Frame-Voyager: Learning to Query Frames for Video LLMs (Yu et al., 2024)
- LFS: Learnable Frame Selector for Event-Aware and Temporally Diverse Video Captioning (Chao et al., 21 Jan 2026)
- M-LLM Based Video Frame Selection for Efficient Video Understanding (Hu et al., 27 Feb 2025)
- AdaFrame: Adaptive Frame Selection for Fast Video Recognition (Wu et al., 2018)
- Frame Mining: a Free Lunch for Learning Robotic Manipulation from 3D Point Clouds (Liu et al., 2022)
- FrameRS: A Video Frame Compression Model Composed by Self supervised Video Frame Reconstructor and Key Frame Selector (Fu et al., 2023)
- ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding (Lee et al., 2 Jun 2025)