Large Model Sequential Keyframe Extraction

Updated 29 November 2025

Large Model-based Sequential Keyframe Extraction (LMSKE) is an approach that selects representative video frames while preserving semantic content and temporal order through deep feature clustering and redundancy elimination.
It integrates models like CLIP and TransNetV2 to perform adaptive clustering, optimizing keyframe relevance, diversity, and computational efficiency under strict token budgets.
Empirical evaluations show that LMSKE and its variants enhance keyframe selection accuracy and narrative recovery, leading to improved video summarization and multimodal analysis.

Large Model based Sequential Keyframe Extraction (LMSKE) refers to a class of algorithms that use large pre-trained deep neural models to select a minimal yet representative ordered subset of video frames—keyframes—preserving both semantic and temporal structure for summarization or downstream multimodal LLM analysis. LMSKE is motivated by the need to compress videos (often containing $10^5$ frames) into tractable sets compatible with the token budgets of LLMs and retrieval systems, without losing critical information or narrative coherence.

1. Foundational Principles and Motivation

Keyframe extraction seeks a mapping from a video sequence $\mathcal{V} = \{x_i\}_{i=1}^l$ to a reduced ordered set $\mathcal{K} \subseteq \mathcal{V}$ , maintaining maximal coverage of visual and semantic content. Sequential extraction, a defining requirement in LMSKE, ensures that the order of $\mathcal{K}$ matches original video chronology, supporting temporal downstream tasks such as video browsing, scene retrieval, and narrative-based understanding (Tan et al., 10 Jan 2024).

Traditional keyframe selection methods—uniform subsampling, histogram comparison, or simple clustering—do not incorporate large-scale visual representation learning, semantic context, or joint relevance/diversity optimization. The proliferation of multimodal LLMs (MLLMs) and vision-LLMs has shifted the field toward sequential extraction leveraging deep feature embeddings, adaptive selection criteria, and optimization under strict token or compute budgets (Zhu et al., 31 Oct 2025, Tang et al., 28 Feb 2025, Fang et al., 30 May 2025).

2. Large Model Integration: Workflow and Algorithms

LMSKE comprises three main stages:

Stage I. Shot Segmentation & Feature Extraction

TransNetV2 (temporal convolutional network with residual links) segments the input video into shots $\mathcal{B} = \{(s_i, e_i)\}_{i=1}^m$ , each shot $\mathcal{S}_i = \{x_j\}_{j=s_i}^{e_i}$ .
CLIP visual encoder transforms every frame $x_j$ into a 768-dimensional embedding $\mathbf{f}_j \in \mathbb{R}^{768}$ after ImageNet normalizations.

Stage II. Adaptive Clustering for Candidate Keyframe Generation

For shot $\mathcal{S}_i$ ( $n$ frames), upper bound cluster count $k_{\max} = \lfloor \sqrt{n} \rfloor$ is used.
Iterative SSE minimization: at each iteration, select frame feature $\mathbf{x}$ not already a center, minimizing

$SSE(\mathcal{M} \cup \{\mathbf{x}\}) = \sum_{k=1}^n \min_{c \in \mathcal{M} \cup \{\mathbf{x}\}} ||\mathbf{x}_k - c||_2^2.$

Clustering partition refined using the mean silhouette coefficient (SC), maximizing

$S(i) = \frac{b(i) - a(i)}{\max\{a(i), b(i)\}},$

with $SC = n^{-1} \sum_i S(i)$ .

Within each finalized cluster, candidate keyframes $a_{i, j}$ are those closest to centroids.

Stage III. Redundancy Elimination and Sequential Concatenation

Candidate frames undergo HSV histogram-based blank frame detection (bins $8 \times 8 \times 8$ , minimum 10 nonzero bins).
Pairwise redundancy elimination using histogram or cosine similarity; iteratively prune until $\max_{p \neq q} SIM_{pq} < 0.8$ .
Final per-shot keyframes $\mathcal{K}_i$ are concatenated in shot order: $\mathcal{K} = \bigoplus_{i=1}^m \mathcal{K}_i$ .

Extensions in the MLLM Era

FOCUS (Zhu et al., 31 Oct 2025): Casts selection as a pure-exploration combinatorial multi-armed bandit, identifying query-relevant regions without trainable parameters, using empirical mean and Bernstein confidence bounds for theoretical PAC identification.
AKS (Tang et al., 28 Feb 2025): Jointly optimizes relevance (frame-query matching) and coverage (spatial dispersal) via recursive “judge-and-split” adaptive selection.
Nar-KFC (Fang et al., 30 May 2025): Utilizes integer quadratic programming maximizing query-relevance and frame-diversity in a $K$ -node subgraph, solved via efficient greedy heuristics, and augments the selected frames with temporally-ordered interleaved textual narratives generated using captioners.

LMSKE Algorithm	Large Models Utilized	Optimization Focus
LMSKE (Tan et al., 10 Jan 2024)	TransNetV2, CLIP	SSE+SC clustering, redundancy pruning
FOCUS (Zhu et al., 31 Oct 2025)	BLIP-ITM, MLLMs	Bandit exploration, PAC guarantees
AKS (Tang et al., 28 Feb 2025)	BLIP/CLIP, MLLMs	Relevance+coverage recursive
Nar-KFC (Fang et al., 30 May 2025)	CLIP, Qwen2-VL, captioners	IQP+diversity+caption threading

3. Adaptive Clustering and Greedy Optimization

Adaptive selection under large models is characterized by dynamic cluster sizing, iterative merging/splitting, and joint objectives:

LMSKE’s SSE minimization combined with silhouette merges allows clusters to self-organize around semantic or scene boundaries; selection is always local to the nearest centroid, preserving semantic compactness.
AKS and Nar-KFC extend selection to relevance-aware and diversity-aware combinatorial optimization. AKS introduces a recursive binary splitting based on relevance gain thresholds, guaranteeing temporal coverage. Nar-KFC leverages pairwise frame diversity and query relevance in a quadratic objective, using a low-rank greedy method for tractable selection even at high $N$ .

A key insight is that combining semantic, temporal, and diversity signals yields superior coverage compared with pure relevance or uniform baselines: AKS and Nar-KFC empirically outperform both on long-video QA tasks (Tang et al., 28 Feb 2025, Fang et al., 30 May 2025).

4. Keyframe Redundancy Management and Narrative Recovery

Sequential LMSKE pipelines aggressively prune redundant or non-informative frames to maximize compression:

LMSKE leverages histogram thresholds and iterative similarity pruning. Frames with insufficient HSV content, or those exceeding similarity threshold (0.8), are discarded.
Nar-KFC replaces missing temporal continuity by inserting succinct captions between keyframes, generated by lightweight captioners. This threaded approach restores narrative information lost during visual token reduction, yielding temporally aligned multimodal content.

The effectiveness of narrative threading is evidenced by ablations showing +2.9 to +3 percentage point gains in QA accuracy upon interleaving captions with selected frames (Fang et al., 30 May 2025).

5. Integration with Multimodal LLMs

Modern MLLM architectures impose strict token budgets ( $K \ll T$ ), often limiting visual input to 32–64 frames. LMSKE algorithms serve as front-end selectors, ensuring only requisite visual tokens are presented to downstream transformers or VL adapters:

FOCUS and AKS provide plug-and-play modules, decoupling selection from MLLM inference (Zhu et al., 31 Oct 2025, Tang et al., 28 Feb 2025).
LMSKE outputs (keyframes $\mathcal{K}$ ) are encoded by visual transformers (e.g., SigLIP, ViT-Qformer) and concatenated with text prompts for input to the MLLM.

Empirical results confirm that all major LMSKE variants (LMSKE, FOCUS, AKS, Nar-KFC) deliver 2–12 percentage point accuracy improvements over uniform and top- $K$ scoring on QA benchmarks (LongVideoBench, Video-MME, MLVU), with modest run-time overhead (<2% of frames scored, total wall time <6 GPU-hours for hour-long videos) (Zhu et al., 31 Oct 2025, Fang et al., 30 May 2025, Tan et al., 10 Jan 2024).

6. Evaluation Protocols and Performance Metrics

Evaluation involves ground-truth shot-aware keyframe annotation:

TVSum20 (Tan et al., 10 Jan 2024) dataset: Experts annotate importance scores every 2 s, providing reference keyframes.
Metrics:
- Precision, Recall, F1 score versus ground-truth keyframes:
$P = \frac{|\mathcal{K} \cap \mathcal{G}|}{|\mathcal{K}|}, \qquad R = \frac{|\mathcal{K} \cap \mathcal{G}|}{|\mathcal{G}|}, \qquad F1 = \frac{2PR}{P + R}.$ - Fidelity (average maximum embedding similarity):

$\mathrm{Fidelity} = \frac{1}{l} \sum_{i=1}^l \max_{k \in \mathcal{K}} \frac{\mathbf{f}_i \cdot \mathbf{f}_k}{\|\mathbf{f}_i\| \|\mathbf{f}_k\|}.$ - Compression Ratio:

$\mathrm{CR} = 1 - \frac{|\mathcal{K}|}{l}.$

LMSKE (Tan et al., 10 Jan 2024) achieves F1 0.5311 (+2.77% over INCEPTION), fidelity 0.8141 (+2.97%), and CR 0.9922 (+0.14%), statistically significant by paired t-tests ( $p<0.05$ ). FOCUS (Zhu et al., 31 Oct 2025) demonstrates up to +11.9% QA accuracy on videos $>$ 20 min, while Nar-KFC reports up to +10.1 pp improvement over uniform sampling, especially when combining diversity and narrative features.

7. Implementation Details, Complexity, and Practical Guidance

LMSKE:
- Hyperparameters: $k_{\max} = \lfloor \sqrt{n} \rfloor$ , HSV bins $8 \times 8 \times 8$ , similarity threshold 0.8.
- Complexity: TransNetV2 $O(l)$ , CLIP $O(l)$ , clustering up to $O(nk_{\max}^2)$ , redundancy elimination $O(k_i^2)$ .
- Run-time: 2 minutes for 18K frames (10 min video) on NVIDIA V100, dominated by feature extraction (Tan et al., 10 Jan 2024).
FOCUS:
- Bandit parameters: $q=2$ , $z=4$ , $\alpha=0.25$ , $m=K$ ( $K=64$ ), total frames processed $<$ 2% (Zhu et al., 31 Oct 2025).
- GPU time: 5.5 hours (LongVideoBench).
AKS, Nar-KFC:
- AKS: Recursive depth $L$ , threshold $s_{\mathrm{thr}}$ tailored by task, downsampling candidate frames to $0.25$–$1$ fps (Tang et al., 28 Feb 2025).
- Nar-KFC: Greedy selection is $O(NK)$ , narrative captioning and refinement add $<$ 1s overhead per movie-length video (Fang et al., 30 May 2025).

All reported variants are run as training-free, plug-and-play selection modules, requiring no retraining of downstream MLLMs, and are compatible with zero-shot or frozen-parameter inference protocols.

LMSKE defines a principled, computationally efficient approach to sequential video keyframe selection, combining deep feature extraction, adaptive clustering/optimization, redundancy control, and narrative recovery. It yields empirically validated improvements for both video summarization and long video understanding in the LLM setting, laying the groundwork for future scalable, semantically-rich video analysis workflows (Tan et al., 10 Jan 2024, Zhu et al., 31 Oct 2025, Tang et al., 28 Feb 2025, Fang et al., 30 May 2025).