Submodular Keyframe Selection Methods
- Submodular keyframe selection is a framework based on diminishing returns principles to ensure representative and diverse frame selection.
- It integrates relevance, coverage, and diversity objectives using greedy and streaming algorithms for efficient, resource-constrained optimization.
- Applications in video QA, LiDAR SLAM, and map summarization demonstrate reduced redundancy and improved performance over heuristic methods.
Submodular keyframe selection is a family of principled, algorithmically efficient approaches for selecting representative, informative, or diverse subsets of video frames (keyframes) or point cloud scans under resource constraints. The core idea is to cast the selection task as the maximization of a submodular set function—a class of objectives which exhibit diminishing returns properties and admit strong theoretical guarantees under simple greedy search. Submodular formulations now structure state-of-the-art pipelines for long video understanding in vision–LLMs, LiDAR-based SLAM systems, and large-scale map summarization, consistently reducing redundancy and improving performance relative to heuristic sampling.
1. Formal Definition and Submodularity Principles
A set-function is submodular if for all and , the marginal gain from adding is diminishing:
Many objectives for keyframe selection—especially those based on coverage, facility location, representativeness, or diversity—are submodular. Key practical properties include monotonicity, non-negativity, and normalization, all exploitable via greedy or streaming algorithms that achieve or approximation ratios for cardinality-constrained problems (Tang et al., 28 Feb 2025, Huang et al., 20 Mar 2026, Thorne et al., 2024).
2. Canonical Objective Functions
The literature distinguishes between modular (additive) objectives for relevance and genuinely submodular objectives for coverage, representativeness, or information gain:
- Relevance: , where quantifies frame–query similarity in embedding space (e.g., cosine similarity, ITM score). This term is modular, hence both submodular and supermodular (Tang et al., 28 Feb 2025, Huang et al., 20 Mar 2026).
- Coverage/Facility Location: , where 0 denotes semantic similarity (e.g., via DINOv2 embeddings), assigning each candidate to its closest representative in 1 (Huang et al., 20 Mar 2026).
- Temporal/Descriptor Diversity: Penalizing overpopulating temporal bins or maximizing minimal pairwise distances in descriptor space is used to enforce distribution and novelty (Tang et al., 28 Feb 2025, Thorne et al., 2024).
- Localization Sensitivity: In SLAM, metrics like 2 of the scan-matching Hessian function as submodular surrogates for localization robustness (Thorne et al., 2024).
Typical objectives are non-negative linear combinations, e.g.,
3
exhibiting monotonicity and submodularity for 4 (Huang et al., 20 Mar 2026, Tang et al., 28 Feb 2025).
3. Algorithms and Theoretical Guarantees
Greedy Maximization
The classical greedy algorithm iteratively selects the element with the largest marginal gain until the cardinality or resource constraint is met. For a nonnegative, monotone, submodular 5 under 6, greedy offers the optimal 7 approximation (Huang et al., 20 Mar 2026, Tang et al., 28 Feb 2025, Thorne et al., 2024):
8
Specialized Heuristics
- Adaptive Sampling (ADA): A tree-structured recursive approach that partitions the timeline, picking top-scoring frames or enforcing coverage if the relevance gap is below a threshold, with computational complexity reduced to 9 relative to greedy’s 0 (Tang et al., 28 Feb 2025).
- Streaming and Sieve-Streaming: For massive or online settings, streaming approximations process each candidate in a single pass, achieving 1 guarantees for submodular summarization (Thorne et al., 2024).
Pseudocode (Greedy, (Huang et al., 20 Mar 2026)):
5
4. Embedding Spaces, Feature Extraction, and Query Adaptation
Keyframe selection efficacy hinges on semantically meaningful frame- or scan-level embeddings:
- Vision–Language Embeddings: SigLIP and BLIP for visual–text similarity, with DINOv2 for semantic coverage (Huang et al., 20 Mar 2026).
- 3D Descriptor Spaces: LiDAR scans represented by compact, learnt descriptors (e.g., 256-D, trained for 3D-Jaccard) (Thorne et al., 2024).
Query adaptation is increasingly explicit. In long-video VLMs, relevance scores are defined to the current question or prompt embedding, and the relevance/coverage balance 2 is dynamically set via question-type classification (e.g., using a transformer with 97.7% accuracy in routing (Huang et al., 20 Mar 2026)).
5. Empirical Results and Benchmarks
Empirical studies corroborate theoretical guarantees and practical benefits of submodular keyframe selection.
| Domain | Dataset/Task | Keyframe Reduction | Performance Impact | Reference |
|---|---|---|---|---|
| Video QA, MLLM | LVB, V-MME | 64 / video | +3–5 pp accuracy vs. uniform | (Tang et al., 28 Feb 2025) |
| VLM QA | MLVU | 10–100 (varied) | +3–8 pp vs. baselines | (Huang et al., 20 Mar 2026) |
| LiDAR SLAM | DLIOM, Mout-Water | -80% keyframes | RMSE Δ ≤0.02m, -64% memory | (Thorne et al., 2024) |
Trends:
- Submodular approaches consistently outperform uniform or purely relevance-based sampling, especially under tight resource budgets.
- In VLM scenarios, accuracy gains are magnified on multi-fact or evidence-dispersed queries.
- SLAM pipelines achieve large reductions in storage and submap size with negligible localization accuracy loss.
6. Variants, Extensions, and Domain-Specific Challenges
Domain-specific variants alter the ground set (video frames, LiDAR scans, submaps), constraints (cardinality, memory, byte-budget), and submodular criteria (coverage, diversity, Hessian-based localization). Notable developments include:
- Online Keyframe Selection: Accepting frames if their embedding is sufficiently novel or their addition lifts a degeneracy metric (Thorne et al., 2024).
- Task-Adaptive Presets: Routing queries to preset relevance/coverage balances via question-type classifiers, shown to improve performance over fixed settings (Huang et al., 20 Mar 2026).
- Map Summarization: Streaming 3-medoid summarizers for global map compaction in SLAM, guaranteeing 4 approximation (Thorne et al., 2024).
A plausible implication is that further advancements in embedding quality, query adaptation, and streaming optimization may continue to advance effectiveness in memory- or compute-constrained environments.
7. Limitations and Open Challenges
Current approaches are constrained by computational and memory overheads for very large candidate pools, limitations in embedding discriminability, and the suboptimality margins inherent to greedy or streaming approximation. In dynamic or lifelong learning settings, handling concept drift and evolving criteria remains a challenge. Finally, while submodular objectives are robust, their expressivity may be limited for highly structured or temporally dependent phenomena, suggesting further research into structured or conditional submodularity.