Papers
Topics
Authors
Recent
Search
2000 character limit reached

Submodular Keyframe Selection Methods

Updated 16 May 2026
  • Submodular keyframe selection is a framework based on diminishing returns principles to ensure representative and diverse frame selection.
  • It integrates relevance, coverage, and diversity objectives using greedy and streaming algorithms for efficient, resource-constrained optimization.
  • Applications in video QA, LiDAR SLAM, and map summarization demonstrate reduced redundancy and improved performance over heuristic methods.

Submodular keyframe selection is a family of principled, algorithmically efficient approaches for selecting representative, informative, or diverse subsets of video frames (keyframes) or point cloud scans under resource constraints. The core idea is to cast the selection task as the maximization of a submodular set function—a class of objectives which exhibit diminishing returns properties and admit strong theoretical guarantees under simple greedy search. Submodular formulations now structure state-of-the-art pipelines for long video understanding in vision–LLMs, LiDAR-based SLAM systems, and large-scale map summarization, consistently reducing redundancy and improving performance relative to heuristic sampling.

1. Formal Definition and Submodularity Principles

A set-function f:2VRf:2^V\mapsto \mathbb{R} is submodular if for all ABVA\subseteq B\subseteq V and x∉Bx\not\in B, the marginal gain from adding xx is diminishing:

Δf(xA)=f(A{x})f(A)f(B{x})f(B)=Δf(xB).\Delta_f(x|A) = f(A\cup\{x\}) - f(A) \geq f(B\cup\{x\}) - f(B) = \Delta_f(x|B).

Many objectives for keyframe selection—especially those based on coverage, facility location, representativeness, or diversity—are submodular. Key practical properties include monotonicity, non-negativity, and normalization, all exploitable via greedy or streaming algorithms that achieve (11/e)(1-1/e) or (1/2ϵ)(1/2-\epsilon) approximation ratios for cardinality-constrained problems (Tang et al., 28 Feb 2025, Huang et al., 20 Mar 2026, Thorne et al., 2024).

2. Canonical Objective Functions

The literature distinguishes between modular (additive) objectives for relevance and genuinely submodular objectives for coverage, representativeness, or information gain:

  • Relevance: R(S)=iSs(Q,Fi)R(S)=\sum_{i\in S} s(Q,F_i), where s(Q,Fi)s(Q,F_i) quantifies frame–query similarity in embedding space (e.g., cosine similarity, ITM score). This term is modular, hence both submodular and supermodular (Tang et al., 28 Feb 2025, Huang et al., 20 Mar 2026).
  • Coverage/Facility Location: C(S)=jVmaxiSsim(ei,ej)C(S)=\sum_{j\in V}\max_{i\in S} \text{sim}(e_i,e_j), where ABVA\subseteq B\subseteq V0 denotes semantic similarity (e.g., via DINOv2 embeddings), assigning each candidate to its closest representative in ABVA\subseteq B\subseteq V1 (Huang et al., 20 Mar 2026).
  • Temporal/Descriptor Diversity: Penalizing overpopulating temporal bins or maximizing minimal pairwise distances in descriptor space is used to enforce distribution and novelty (Tang et al., 28 Feb 2025, Thorne et al., 2024).
  • Localization Sensitivity: In SLAM, metrics like ABVA\subseteq B\subseteq V2 of the scan-matching Hessian function as submodular surrogates for localization robustness (Thorne et al., 2024).

Typical objectives are non-negative linear combinations, e.g.,

ABVA\subseteq B\subseteq V3

exhibiting monotonicity and submodularity for ABVA\subseteq B\subseteq V4 (Huang et al., 20 Mar 2026, Tang et al., 28 Feb 2025).

3. Algorithms and Theoretical Guarantees

Greedy Maximization

The classical greedy algorithm iteratively selects the element with the largest marginal gain until the cardinality or resource constraint is met. For a nonnegative, monotone, submodular ABVA\subseteq B\subseteq V5 under ABVA\subseteq B\subseteq V6, greedy offers the optimal ABVA\subseteq B\subseteq V7 approximation (Huang et al., 20 Mar 2026, Tang et al., 28 Feb 2025, Thorne et al., 2024):

ABVA\subseteq B\subseteq V8

Specialized Heuristics

  • Adaptive Sampling (ADA): A tree-structured recursive approach that partitions the timeline, picking top-scoring frames or enforcing coverage if the relevance gap is below a threshold, with computational complexity reduced to ABVA\subseteq B\subseteq V9 relative to greedy’s x∉Bx\not\in B0 (Tang et al., 28 Feb 2025).
  • Streaming and Sieve-Streaming: For massive or online settings, streaming approximations process each candidate in a single pass, achieving x∉Bx\not\in B1 guarantees for submodular summarization (Thorne et al., 2024).

x∉Bx\not\in B5

4. Embedding Spaces, Feature Extraction, and Query Adaptation

Keyframe selection efficacy hinges on semantically meaningful frame- or scan-level embeddings:

Query adaptation is increasingly explicit. In long-video VLMs, relevance scores are defined to the current question or prompt embedding, and the relevance/coverage balance x∉Bx\not\in B2 is dynamically set via question-type classification (e.g., using a transformer with 97.7% accuracy in routing (Huang et al., 20 Mar 2026)).

5. Empirical Results and Benchmarks

Empirical studies corroborate theoretical guarantees and practical benefits of submodular keyframe selection.

Domain Dataset/Task Keyframe Reduction Performance Impact Reference
Video QA, MLLM LVB, V-MME 64 / video +3–5 pp accuracy vs. uniform (Tang et al., 28 Feb 2025)
VLM QA MLVU 10–100 (varied) +3–8 pp vs. baselines (Huang et al., 20 Mar 2026)
LiDAR SLAM DLIOM, Mout-Water -80% keyframes RMSE Δ ≤0.02m, -64% memory (Thorne et al., 2024)

Trends:

  • Submodular approaches consistently outperform uniform or purely relevance-based sampling, especially under tight resource budgets.
  • In VLM scenarios, accuracy gains are magnified on multi-fact or evidence-dispersed queries.
  • SLAM pipelines achieve large reductions in storage and submap size with negligible localization accuracy loss.

6. Variants, Extensions, and Domain-Specific Challenges

Domain-specific variants alter the ground set (video frames, LiDAR scans, submaps), constraints (cardinality, memory, byte-budget), and submodular criteria (coverage, diversity, Hessian-based localization). Notable developments include:

  • Online Keyframe Selection: Accepting frames if their embedding is sufficiently novel or their addition lifts a degeneracy metric (Thorne et al., 2024).
  • Task-Adaptive Presets: Routing queries to preset relevance/coverage balances via question-type classifiers, shown to improve performance over fixed settings (Huang et al., 20 Mar 2026).
  • Map Summarization: Streaming x∉Bx\not\in B3-medoid summarizers for global map compaction in SLAM, guaranteeing x∉Bx\not\in B4 approximation (Thorne et al., 2024).

A plausible implication is that further advancements in embedding quality, query adaptation, and streaming optimization may continue to advance effectiveness in memory- or compute-constrained environments.

7. Limitations and Open Challenges

Current approaches are constrained by computational and memory overheads for very large candidate pools, limitations in embedding discriminability, and the suboptimality margins inherent to greedy or streaming approximation. In dynamic or lifelong learning settings, handling concept drift and evolving criteria remains a challenge. Finally, while submodular objectives are robust, their expressivity may be limited for highly structured or temporally dependent phenomena, suggesting further research into structured or conditional submodularity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Submodular Keyframe Selection.