KFS-Bench: Keyframe Sampling Benchmark
- KFS-Bench is a benchmark that assesses keyframe sampling methods for long video QA by using explicit scene-level ground-truth annotations.
- It compares diverse sampling techniques—uniform, clustering-based, similarity-driven, and adaptive methods—using metrics like KFR, SHR, and UKSS.
- The benchmark introduces an adaptive sampling strategy that optimizes frame selection based on semantic alignment and coverage, boosting QA performance.
KFS-Bench is a benchmark specifically devised to enable direct, systematic evaluation of key frame sampling strategies within the context of long video question answering (QA). Unlike prior approaches, which indirectly assessed frame selection quality by downstream QA accuracy, KFS-Bench introduces explicit ground-truth annotations at the scene level, facilitating robust measurement of how well sampling methods capture the essential, answer-relevant content distributed across temporally and semantically disjoint video regions. This focus addresses a critical bottleneck for efficient and accurate multimodal LLMs (MLLMs), where judicious frame selection can dramatically improve both the accuracy and computational economy of long video understanding (Li et al., 16 Dec 2025).
1. Dataset Design and Multi-Scene Annotation Protocol
KFS-Bench integrates two principal source datasets: LongVideoBench (validation set, originally 1,337 video–question pairs) and VideoMME (900 pairs sampled from 2,700, balanced across 11 task types with “Information Synopsis” excluded). After discarding pairs due to annotation difficulties or insufficient quality, KFS-Benchₖ𝒻ₛ contains 1,291 LongVideoBench pairs and 888 VideoMME pairs, totalling 2,179 video–question pairs.
Each video is uniformly decoded at 1 frame-per-second (fps), spanning a few seconds up to approximately one hour. For each question, expert annotators associate one or more temporally disjoint segments (1 s granularity) necessary to answer the question, grouped by scene ID. Every ground-truth scene is annotated with a set
where non-overlapping segments collectively define the content required per scene; scene IDs signal that the union of these segments is needed for completeness.
2. Sampling Methods and Comparative Baselines
KFS-Bench supports comprehensive evaluation across diverse key frame sampling methods:
- Uniform Sampling: Selects frames at equidistant intervals throughout the video.
- Random Sampling: Selects frames uniformly randomly, which at 1 fps is often functionally equivalent to uniform sampling.
- Clustering-Based Sampling (K-means): Clusters frame features into clusters, samples one frame per cluster to maximize diversity, without temporal constraints.
- Top-k Similarity (Goldfish): Ranks frames by their semantic similarity to the input question and selects the most similar.
- Adaptive Keyframe Sampling (AKS, Tang et al. 2025): Applies a heuristic threshold to balance content coverage and relevance.
- Inverse Transform Sampling (ITS, Liu et al. 2025): Constructs a question-guided cumulative distribution function (CDF) over frames and samples via inverse transform parameterized by exponent (Li et al., 16 Dec 2025).
Evaluation metrics include Key Frame Rate (KFR, precision relative to ground-truth segment-aligned frames), Scene Hit Rate (SHR, coverage across the annotated scenes), and controlled experiments varying the target frame-distribution vector via a Dirichlet process with duration-weighted interpolation.
3. Unified Keyframe Sampling Score (UKSS)
Recognizing the need to jointly evaluate precision, coverage, and distribution balance, KFS-Bench introduces a composite metric:
- Balanced Scene Recall (BSR) imposes minimum per-scene sampled frame quotas:
and computes
- Balanced Distribution Similarity (BDS) compares the empirical frame distribution to a set of ideal distributions for , selecting the best via cosine similarity:
- Unified Keyframe Sampling Score (UKSS) combines the above with precision (KFR):
Correlation analysis across 400 configurations (varying sample methods, datasets, budgets, and encoders) found that UKSS Spearman’s with QA accuracy is typically $0.53$–$0.89$ (all values , most ), establishing UKSS as a reliable and differentiable offline proxy for downstream QA performance (Li et al., 16 Dec 2025).
4. Adaptively Balanced Sampling (ASCS)
The benchmark introduces an adaptively balanced sampling method (ASCS), synthesizing features of clustering- and similarity-driven approaches. Key elements:
- CDF Construction: Two CDFs are formed: from clustering (ICF) and from normalized similarity (ITS).
- Question–Video Relevance Score (QVRS): From median absolute deviation (MAD)-normalized similarities , a softmax-weighted distribution is derived. Three signals—temporal-bin entropy , mass-based span entropy , and shortest coverage window —are combined:
QVRS adaptively shifts weighting: high QVRS signals a visually localized (“focused”) query, while low QVRS signals a dispersed (“global”) information need.
- Blending and Sampling: The combined CDF
is used for inverse transform sampling. A pseudocode sketch involves feature extraction, similarity computation, CDF construction, QVRS calculation, CDF interpolation, and frame selection iteratively by quantile thresholds.
This approach maximizes coverage and semantic alignment dynamically per question, improving both the UKSS and the resulting QA metrics (Li et al., 16 Dec 2025).
5. Experimental Evaluation
Evaluation leverages two MLLMs (Qwen2.5-VL-7B and InternVL3-8B) and frame budgets ($32$, $64$). The top results are summarized below:
| Model / Frame Budget | Uniform | K-means | AKS | ITS | ASCS |
|---|---|---|---|---|---|
| Qwen2.5-VL-7B / 32 | 60.5% / 57.2% | 60.9% / 59.8% | 62.4% / 61.5% | 62.0% / 61.9% | 63.1% / 63.6% |
| Qwen2.5-VL-7B / 64 | 64.0% / 59.5% | ... | ... | ... | 66.0% / 64.1% |
| InternVL3-8B / 64 | 65.4% / 60.8% | ... | ... | ... | 67.6% / 65.4% |
ASCS consistently achieves the highest QA accuracy—gains are most pronounced at smaller frame budgets and attenuate as budgets increase, due to the diminishing marginal returns. K-means surpasses uniform sampling by greater scene diversity; however, similarity-driven sampling (ITS/AKS) may neglect rare scenes, an imbalance corrected by ASCS's adaptive strategy. ASCS also achieves the top UKSS, and all methods experience significant UKSS degradation when cutting from $64$ to $32$ frames, predominantly owing to decreased coverage (Li et al., 16 Dec 2025).
6. Key Findings and Trajectories
- Three interlocking factors—Key Frame Rate (precision), Scene Coverage (coverage), and Balanced Distribution (balance)—jointly control the quality of key frame sampling and thus QA performance.
- UKSS enables differentiable, unified, and rapid offline assessment/tuning of sampling algorithms, correlating robustly with QA accuracy.
- Adaptive blending, governed by the question–video relevance signal (QVRS), is empirically validated as effective: it selectively emphasizes semantic alignment or diversity based on question requirements.
- Open research challenges include extending annotation and sampling to incorporate multimodal cues (audio, subtitles), end-to-end learnable sampling strategies within MLLMs to address hallucination biases, and dynamically allocating frame budgets in response to question complexity or computational constraints (Li et al., 16 Dec 2025).
A plausible implication is that KFS-Bench and the UKSS metric will serve as foundational resources for principled, performance-aware research on frame sampling in long-form video understanding.