Submodular Keyframe Selection Methods

Updated 16 May 2026

Submodular keyframe selection is a framework based on diminishing returns principles to ensure representative and diverse frame selection.
It integrates relevance, coverage, and diversity objectives using greedy and streaming algorithms for efficient, resource-constrained optimization.
Applications in video QA, LiDAR SLAM, and map summarization demonstrate reduced redundancy and improved performance over heuristic methods.

Submodular keyframe selection is a family of principled, algorithmically efficient approaches for selecting representative, informative, or diverse subsets of video frames (keyframes) or point cloud scans under resource constraints. The core idea is to cast the selection task as the maximization of a submodular set function—a class of objectives which exhibit diminishing returns properties and admit strong theoretical guarantees under simple greedy search. Submodular formulations now structure state-of-the-art pipelines for long video understanding in vision–LLMs, LiDAR-based SLAM systems, and large-scale map summarization, consistently reducing redundancy and improving performance relative to heuristic sampling.

1. Formal Definition and Submodularity Principles

A set-function $f:2^V\mapsto \mathbb{R}$ is submodular if for all $A\subseteq B\subseteq V$ and $x\not\in B$ , the marginal gain from adding $x$ is diminishing:

$\Delta_f(x|A) = f(A\cup\{x\}) - f(A) \geq f(B\cup\{x\}) - f(B) = \Delta_f(x|B).$

Many objectives for keyframe selection—especially those based on coverage, facility location, representativeness, or diversity—are submodular. Key practical properties include monotonicity, non-negativity, and normalization, all exploitable via greedy or streaming algorithms that achieve $(1-1/e)$ or $(1/2-\epsilon)$ approximation ratios for cardinality-constrained problems (Tang et al., 28 Feb 2025, Huang et al., 20 Mar 2026, Thorne et al., 2024).

2. Canonical Objective Functions

The literature distinguishes between modular (additive) objectives for relevance and genuinely submodular objectives for coverage, representativeness, or information gain:

Relevance: $R(S)=\sum_{i\in S} s(Q,F_i)$ , where $s(Q,F_i)$ quantifies frame–query similarity in embedding space (e.g., cosine similarity, ITM score). This term is modular, hence both submodular and supermodular (Tang et al., 28 Feb 2025, Huang et al., 20 Mar 2026).
Coverage/Facility Location: $C(S)=\sum_{j\in V}\max_{i\in S} \text{sim}(e_i,e_j)$ , where $A\subseteq B\subseteq V$ 0 denotes semantic similarity (e.g., via DINOv2 embeddings), assigning each candidate to its closest representative in $A\subseteq B\subseteq V$ 1 (Huang et al., 20 Mar 2026).
Temporal/Descriptor Diversity: Penalizing overpopulating temporal bins or maximizing minimal pairwise distances in descriptor space is used to enforce distribution and novelty (Tang et al., 28 Feb 2025, Thorne et al., 2024).
Localization Sensitivity: In SLAM, metrics like $A\subseteq B\subseteq V$ 2 of the scan-matching Hessian function as submodular surrogates for localization robustness (Thorne et al., 2024).

Typical objectives are non-negative linear combinations, e.g.,

$A\subseteq B\subseteq V$ 3

exhibiting monotonicity and submodularity for $A\subseteq B\subseteq V$ 4 (Huang et al., 20 Mar 2026, Tang et al., 28 Feb 2025).

3. Algorithms and Theoretical Guarantees

Greedy Maximization

The classical greedy algorithm iteratively selects the element with the largest marginal gain until the cardinality or resource constraint is met. For a nonnegative, monotone, submodular $A\subseteq B\subseteq V$ 5 under $A\subseteq B\subseteq V$ 6, greedy offers the optimal $A\subseteq B\subseteq V$ 7 approximation (Huang et al., 20 Mar 2026, Tang et al., 28 Feb 2025, Thorne et al., 2024):

$A\subseteq B\subseteq V$ 8

Specialized Heuristics

Adaptive Sampling (ADA): A tree-structured recursive approach that partitions the timeline, picking top-scoring frames or enforcing coverage if the relevance gap is below a threshold, with computational complexity reduced to $A\subseteq B\subseteq V$ 9 relative to greedy’s $x\not\in B$ 0 (Tang et al., 28 Feb 2025).
Streaming and Sieve-Streaming: For massive or online settings, streaming approximations process each candidate in a single pass, achieving $x\not\in B$ 1 guarantees for submodular summarization (Thorne et al., 2024).

$x\not\in B$ 5

4. Embedding Spaces, Feature Extraction, and Query Adaptation

Keyframe selection efficacy hinges on semantically meaningful frame- or scan-level embeddings:

Vision–Language Embeddings: SigLIP and BLIP for visual–text similarity, with DINOv2 for semantic coverage (Huang et al., 20 Mar 2026).
3D Descriptor Spaces: LiDAR scans represented by compact, learnt descriptors (e.g., 256-D, trained for 3D-Jaccard) (Thorne et al., 2024).

Query adaptation is increasingly explicit. In long-video VLMs, relevance scores are defined to the current question or prompt embedding, and the relevance/coverage balance $x\not\in B$ 2 is dynamically set via question-type classification (e.g., using a transformer with 97.7% accuracy in routing (Huang et al., 20 Mar 2026)).

5. Empirical Results and Benchmarks

Empirical studies corroborate theoretical guarantees and practical benefits of submodular keyframe selection.

Domain	Dataset/Task	Keyframe Reduction	Performance Impact	Reference
Video QA, MLLM	LVB, V-MME	64 / video	+3–5 pp accuracy vs. uniform	(Tang et al., 28 Feb 2025)
VLM QA	MLVU	10–100 (varied)	+3–8 pp vs. baselines	(Huang et al., 20 Mar 2026)
LiDAR SLAM	DLIOM, Mout-Water	-80% keyframes	RMSE Δ ≤0.02m, -64% memory	(Thorne et al., 2024)

Trends:

Submodular approaches consistently outperform uniform or purely relevance-based sampling, especially under tight resource budgets.
In VLM scenarios, accuracy gains are magnified on multi-fact or evidence-dispersed queries.
SLAM pipelines achieve large reductions in storage and submap size with negligible localization accuracy loss.

6. Variants, Extensions, and Domain-Specific Challenges

Domain-specific variants alter the ground set (video frames, LiDAR scans, submaps), constraints (cardinality, memory, byte-budget), and submodular criteria (coverage, diversity, Hessian-based localization). Notable developments include:

Online Keyframe Selection: Accepting frames if their embedding is sufficiently novel or their addition lifts a degeneracy metric (Thorne et al., 2024).
Task-Adaptive Presets: Routing queries to preset relevance/coverage balances via question-type classifiers, shown to improve performance over fixed settings (Huang et al., 20 Mar 2026).
Map Summarization: Streaming $x\not\in B$ 3-medoid summarizers for global map compaction in SLAM, guaranteeing $x\not\in B$ 4 approximation (Thorne et al., 2024).

A plausible implication is that further advancements in embedding quality, query adaptation, and streaming optimization may continue to advance effectiveness in memory- or compute-constrained environments.

7. Limitations and Open Challenges

Current approaches are constrained by computational and memory overheads for very large candidate pools, limitations in embedding discriminability, and the suboptimality margins inherent to greedy or streaming approximation. In dynamic or lifelong learning settings, handling concept drift and evolving criteria remains a challenge. Finally, while submodular objectives are robust, their expressivity may be limited for highly structured or temporally dependent phenomena, suggesting further research into structured or conditional submodularity.

Markdown Report Issue Upgrade to Chat

References (3)

Adaptive Keyframe Sampling for Long Video Understanding (2025)

Adaptive Greedy Frame Selection for Long Video Understanding (2026)

Submodular Optimization for Keyframe Selection & Usage in SLAM (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Submodular Keyframe Selection.

Submodular Keyframe Selection Methods

1. Formal Definition and Submodularity Principles

2. Canonical Objective Functions

3. Algorithms and Theoretical Guarantees

Greedy Maximization

Specialized Heuristics

Pseudocode (Greedy, (Huang et al., 20 Mar 2026)):

4. Embedding Spaces, Feature Extraction, and Query Adaptation

5. Empirical Results and Benchmarks

6. Variants, Extensions, and Domain-Specific Challenges

7. Limitations and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Submodular Keyframe Selection Methods

1. Formal Definition and Submodularity Principles

2. Canonical Objective Functions

3. Algorithms and Theoretical Guarantees

Greedy Maximization

Specialized Heuristics

Pseudocode (Greedy, (Huang et al., 20 Mar 2026)):

4. Embedding Spaces, Feature Extraction, and Query Adaptation

5. Empirical Results and Benchmarks

6. Variants, Extensions, and Domain-Specific Challenges

7. Limitations and Open Challenges

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics