Keyframe & Frame Selection Memories
- Keyframe and frame selection memories are algorithmic constructs that extract a small, informative subset of frames from long video streams to support efficient downstream processing.
- They leverage multimodal cues, iterative refinement, and dynamic memory buffers to balance temporal coverage, semantic diversity, and computational efficiency.
- Recent approaches employ submodular optimization, differentiable relaxations, and adaptive token compression to dramatically reduce storage, latency, and redundant data.
Keyframe and Frame Selection Memories
Keyframe and frame selection memories are algorithmic constructs fundamental to video and sequential data processing, enabling efficient storage, representation, and reasoning over long temporal streams under computational and memory constraints. These mechanisms underpin systems for video understanding, video compression, robot mapping, and multimodal LLM (MLLM) inference by distilling high-dimensional temporal data into compact, query-relevant "memories," typically realized as sparse banks of keyframes, distributions over frame importances, or dynamically managed memory buffers. Recent advances crucially integrate multimodal, task-aware, and logic-driven selection criteria, temporal and semantic context propagation, and memory-style iterative refinement architectures, significantly enhancing both accuracy and efficiency across application domains (He et al., 9 Aug 2025, Yang et al., 12 Dec 2025, Guo et al., 17 Mar 2025, Dai et al., 22 Jan 2026, Thorne et al., 2024, Hu et al., 2024).
1. Problem Settings and Motivations
Keyframe and frame selection memories address the problem of extracting a small, highly informative subset of frames or representations from long temporal streams, such that downstream computational tasks (e.g. video question-answering, SLAM, editing, or planning) can be performed efficiently and accurately. In high-dimensional domains, such as long video (thousands of frames) or large-scale point cloud mapping, naively processing every frame confronts severe memory and latency bottlenecks. In the context of MLLMs, the number of visual tokens that can be processed is strictly capped, necessitating aggressive information pre-filtering (Tang et al., 28 Feb 2025, Fang et al., 30 May 2025).
The overarching goal is to select, at inference or pre-processing time, a subset of keyframes (with ), or produce continuous importance scores or pruning rates over all frames/tokens, such that a downstream model achieves task performance close to that obtained using the full sequence. Key challenges involve weak alignment between visual data and textual or downstream objectives, temporal clustering or redundancy among selected frames, loss of context continuity, and the need to preserve both spatiotemporal and semantic diversity (Zhang et al., 8 Feb 2025, Li et al., 7 Dec 2025).
2. Methodological Taxonomy
Keyframe/frame selection memories span a broad methodological taxonomy, with representative approaches summarized below.
2.1 Dual-Stream Multimodal Selection
VSI ("Visual Subtitle Integration") (He et al., 9 Aug 2025) fuses object-level visual search and subtitle-based textual search into a dual-stream architecture. The Video Search Stream scores frames via relevance-weighted object detections, while the Subtitle Match Stream propagates cosine similarity between query and subtitle segments temporally (via Gaussian kernels). Scores are fused to form a sampling distribution that iteratively concentrates selection on promising intervals, effectively encoding a memory of multimodal saliency. Spline-based interpolation maintains continuous confidence over all frames, with the selection process functioning as an external memory module guiding efficient exploration.
2.2 Explicit Score Memory and Iterative Refinement
Logic-in-Frames (VSLS) (Guo et al., 17 Mar 2025) models keyframe selection as a memory-driven iterative search, maintaining explicit per-frame score memories and dynamic sampling distributions . Semantic-logical relations—spatial co-occurrence, temporal proximity, attribute and causal dependencies—update these memories, while score diffusion temporally propagates local context. This iterative, context-aware refinement leads to high recall and coverage, particularly for rare or causally dependent events.
2.3 Set-Level and Token-Level Objectives
Holistic approaches, such as HFS ("Holistic Query-Aware Frame Selection") (Yang et al., 12 Dec 2025), optimize frame selection at the set level by differentiable surrogate objectives balancing query relevance, temporal coverage, and redundancy, enforced via Gumbel-Softmax sampling. A chain-of-thought SLM generates distributed query vectors, mutual learning aligns frame importances with an MLLM teacher, and a soft selection mask maintains continuity and expressiveness in the memory bank.
2.4 Adaptive and Dynamic Token Compression
Recent methods, including KVTP (Liu et al., 13 Mar 2025) and DyToK (Li et al., 7 Dec 2025), generalize beyond binary keyframe selection by dynamically allocating per-frame token budgets based on query-conditioned importance scores. KVTP computes continuous relevance distributions using cross-attention fusion, maintaining both a rich keyframe memory and a pruned-frame memory to balance sparse event capture and contextual continuity. DyToK leverages deep Transformer attention weights in VLLMs to extract normalized keyframe priors, translating these into per-frame retention ratios for downstream token-pruning backbones, attaining state-of-the-art efficiency-accuracy trade-offs without retraining.
2.5 Statistical and Submodular Selection in Mapping
In LiDAR SLAM and 3D mapping, keyframe memories are managed at the point cloud level. Submodular optimization (Thorne et al., 2024) and Wasserstein-based distance metrics (Hu et al., 2024) ensure that selected scan/keyframe banks maximize spatial diversity and information for pose graph optimization, enabling aggressive memory reduction and real-time incremental updates.
2.6 Memory Engines for Tracking and Prediction
Dynamic memory prediction approaches (Zhou et al., 30 Apr 2025) maintain explicitly managed memory banks (short-term and long-term, with selection based on feature similarity and reconstruction IoU), using these as reference sets for fine-grained video object tracking and segmentation.
3. Mathematical and Algorithmic Foundations
The field leverages a variety of formalizations:
- Similarity and Relevance Scoring: Cosine similarity in shared visual-textual embedding spaces (e.g. CLIP/ViT) (Liang et al., 2024), object detection confidence aggregation (He et al., 9 Aug 2025), or matching via learned predictors (Liu et al., 13 Mar 2025).
- Combinatorial and Submodular Optimization: Integer quadratic programming for joint relevance-diversity maximization (Fang et al., 30 May 2025), greedy and streaming algorithms for submodular objectives (diversity, observability) (Thorne et al., 2024).
- Differentiable Relaxations: Gumbel-Softmax/TopK for set-level selection (Yang et al., 12 Dec 2025), fully differentiable keyframe selection via soft temporal distributions (Pertsch et al., 2019).
- Dynamic Memory Buffers: Sliding-window, slot-based, or hierarchical buffers, with explicit or score-driven eviction and update rules (Dai et al., 22 Jan 2026, Zhou et al., 30 Apr 2025).
- Temporal Smoothing and Context Propagation: Gaussian kernel propagation of textual scores (He et al., 9 Aug 2025), diffusion of per-frame confidence (Guo et al., 17 Mar 2025), momentum-based dynamic thresholds (Jha et al., 27 Oct 2025).
- Compression and Resource Allocation: Compression ratios are analytically derived (e.g., for frames, keyframes (Liang et al., 2024)), with empirical trade-offs assessed in task performance and computational usage (Li et al., 7 Dec 2025, Liu et al., 13 Mar 2025).
4. System Architectures and Memory Integration
Keyframe/frame selection memories are integrated into diverse system architectures:
- Plug-and-Play Pre-filtering: As in Adaptive Keyframe Sampling (Tang et al., 28 Feb 2025), wherein keyframe selection modules operate as front-ends, restricting visual token budgets for MLLMs.
- Iterative and Multi-Stage Pipelines: VSI and VSLS maintain iterative loops, where memory arrays directly steer which portions of long videos are further processed in successive rounds.
- Hybrid Visual-Textual Narratives: Nar-KFC (Fang et al., 30 May 2025) interleaves sparse visual tokens (keyframes) with textual narratives (lightweight captions from non-keyframes), enhancing memory with temporal and semantic continuity for downstream reasoning.
- Token-Pruning Coupled with Keyframe Memory: KVTP's (editor's term: "hybrid memory") architecture augments token-level pruning with persistent full-resolution keyframes, ensuring sparse event detection does not compromise global context (Liu et al., 13 Mar 2025).
Table: Summary of Memory Types Across Domains
| Approach | Memory Type | Key Mechanisms |
|---|---|---|
| VSI, VSLS | Score memories | Dual-stream scoring, iterative P(f) |
| HFS, AKS | Soft selection mask | Set-level objectives, Gumbel-TopK |
| KVTP, DyToK | Hybrid token memory | Dynamic retention, fusion heads |
| SLAM (Thorne et al., 2024) | Keyframe banks | Submodular/max-diversity selection |
| DMP (Zhou et al., 30 Apr 2025) | STM/LTM banks | Feature/Iou-based update, bidir net |
5. Impact on Task Performance and Efficiency
Extensive experiments across video QA (Video-MME, LongVideoBench, MLVU, NExT-QA), 3D mapping, and object tracking consistently show that memory-driven, adaptive keyframe/frame selection yields:
- Dramatic memory and storage reduction (compression ratios – for text-driven selection (Liang et al., 2024)).
- Reduced compute and latency (e.g., faster inference with DyToK at 98.5% baseline accuracy (Li et al., 7 Dec 2025); frame selection reduces sampling iterations by (He et al., 9 Aug 2025)).
- Robust, hyperparameter-free operation across datasets (Liang et al., 2024).
- Clear accuracy gains in long-video QA (4–10% absolute over uniform baselines; up to +20pp in keyframe localization (He et al., 9 Aug 2025)).
- Enhanced coverage and recall for rare or logic-dependent events (Guo et al., 17 Mar 2025, Fang et al., 30 May 2025).
- Superior performance in dynamic scenes by adaptive thresholds over static policies (Jha et al., 27 Oct 2025).
6. Practical Recommendations and Open Problems
Empirical studies consistently highlight the importance of:
- Allocating memory dynamically in both the frame and token domains, tuned to semantic relevance and temporal coverage.
- Incorporating multimodal and logical cues, especially in MLLM-based QA and event-driven video understanding (He et al., 9 Aug 2025, Yang et al., 12 Dec 2025).
- Preserving context continuity via either persistence of low-rank information in non-keyframes or interleaving of textual narratives (Liu et al., 13 Mar 2025, Fang et al., 30 May 2025).
- Leveraging fully differentiable or plug-and-play modules to ease integration with downstream architectures and support end-to-end learning or train-free deployment.
- Hyperparameter and threshold settings (e.g., threshold decay ; window size ; sensitivity (Jha et al., 27 Oct 2025)) optimized for dynamic environments.
Remaining challenges include designing adaptive stopping criteria for variable memory budgets (Pertsch et al., 2019), richer perceptual losses for visual quality, and higher-level abstraction for long-horizon planning (Zhou et al., 30 Apr 2025). The increasing modularity and generality of recent approaches (e.g. token-prior based allocation via LLM attention (Li et al., 7 Dec 2025)) suggest further convergence between video, language, and mapping domains in principled memory management strategies.