SpecTemp-80K Dataset: Dual-Level Video QA
- SpecTemp-80K is a large-scale, dual-level annotated dataset comprising 80,142 video–QA triplets from nine diverse corpora, providing both segment- and frame-level supervision.
- Its annotation protocol synchronizes coarse evidence spans with fine-grained frame-level cues, enabling precise temporal evidence localization for multi-modal language models.
- The dataset enhances video QA by improving IoU for segment localization and reducing frame processing latency, offering significant gains over traditional frame-centric methods.
SpecTemp-80K is a large-scale, dual-level annotated dataset for speculative temporal reasoning in long video understanding. Designed to support the training and evaluation of cooperative dual-model frameworks, notably the SpecTemp architecture, the dataset contains 80,142 video–question–answer (QA) triplets spanning a broad temporal spectrum and nine distinct video corpora. Each sample provides synchronized supervision at both coarse (segment-level) and fine (frame-level) temporal resolutions, enabling efficient and accurate temporal evidence localization and reasoning for video multi-modal LLMs (MLLMs) (Hu et al., 30 Nov 2025).
1. Construction and Scope
SpecTemp-80K was constructed to address the limitations of the “thinking-with-frames” paradigm in long video reasoning by providing rich, multi-granular supervision for RL and other learning paradigms that require both segment- and frame-level information. The dataset aggregates N=80,142 video–QA triplets from nine video sources across three temporal categories:
- Short-form (<1 min): CLEVRER, PerceptionTest, STAR, NeXT-GQA (32.4% of videos)
- Medium-length (1–10 min): LLaVA-Video, ActivityNet, YouCook2 (51.8% of videos)
- Long-form (>10 min): MovieChat, Ego4D (15.8% of videos)
The diversity of source domains and temporal regimes (sub-minute to tens of minutes) ensures coverage of a wide range of real-world video understanding scenarios. Precise total dataset duration in hours is not reported, but the proportional splits emphasize robust representation of both brief and extended contexts.
2. Annotation Protocol
Each QA sample is annotated at two synchronized levels:
(a) Coarse Evidence Spans ():
Contiguous temporal segments relevant to answering the query, denoted
where is video duration. Each corresponds to a “<segment>” token in the model's reasoning trajectory.
(b) Fine-Grained Frame-Level Evidence ():
Within each coarse span, a subset of representative frames is selected as most informative:
with as frame timestamp. These are surfaced as “<frame>” tokens.
Annotation Procedure:
- Preliminary Annotations via GPT-4o generate scene- and frame-level captions for events and visual cues.
- Trajectory Generation: GPT-4o simulates multi-round speculation–verification traces marked by > , <segment>, <frame>, and <answer>.
- Manual Validation: A random subset is manually checked for temporal and frame correctness; errors are culled.
Empirical Statistics:
Average spans per video (each question typically triggers one initial and up to 2 additional segment predictions)
Median span length ≈ 15 seconds (long-tailed toward much longer spans)
Average frames per video (draft model selects 2 frames per iteration, up to 3 iterations)
3. Data Format and Mathematical Schema
SpecTemp-80K's organization supports streamlined loading and evaluation:
Directory Structure:
train/,val/,test/(80%/10%/10% splits)videos/: video files or linksannotations.jsonl: one sample per line
JSONL Schema:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
{ "video_id": "string", "duration": float, "question": "string", "answer": "string", "spans": [ {"start": float, "end": float}, ... ], "frames": [ int, ... ], "trajectory": [ { "round": int, "think": "string", "segment": [start, end] or null, "draft_frames": [f1, f2] }, ... ] }Mathematical Formalization:
- ,
- For each :
- and as above
- Full dataset:
4. Splits, Annotation Statistics, and Distributions
The dataset is partitioned as follows:
Split Samples ⟨#Spans⟩ ⟨#Frames⟩ Train 64,113 1.79 4.25 Val 8,014 1.82 4.18 Test 8,015 1.81 4.22 Segment Length and Frame Density:
- Coarse-span length has a right-skewed distribution : ~50% of spans <12 seconds, 10% >60 seconds.
- Within spans, frame density () ≈ 0.1 frames/sec, consistent with 2 frames per ~20 seconds.
This suggests the annotation process closely mirrors real-world sparsity in answer-relevant evidence, supporting efficient learning of temporal localization.
5. Applications and Benchmarking
Recommended Use Cases:
- Temporal Evidence Retrieval: Training models to localize relevant segments.
- Frame-Level Summarization: Learning to select key frames that condense essential information.
- Long-Video Question Answering: Joint training with coarse and fine temporal cues.
- Reinforcement Learning for Frame Sampling: Policy optimization using dual-level annotated supervision.
Benchmark Results:
- Training SpecTemp on SpecTemp-80K achieves up to a +12% absolute gain (on Video-Holmes benchmark), and 19–23% faster inference compared to prior “thinking-with-frames” paradigms.
- IoU for segment localization increases by 3–5%; frame processing latency drops correspondingly.
A plausible implication is that models leveraging SpecTemp-80K’s structured dual supervision are intrinsically better at balancing accuracy and efficiency in QA over long video contexts.
Integration Strategies:
- Pre-train or fine-tune multi-modal LLMs using synchronized span/frame labels.
- Use IoU and information gain measured on and for evaluating novel selection policies.
- Adopt standardized splits for reporting accuracy, localization IoU, and frame selection precision.
6. Significance and Impact
SpecTemp-80K provides a critical resource for advancing research on efficient, scalable, speculation-driven video understanding. Its explicit, dual-level annotations and balanced coverage across short, medium, and long video regimes make it well-suited for developing RL and LLM-based approaches that require both global temporal coherence and fine-grained perceptual grounding. The dataset’s format and rigorously curated splits facilitate reproducibility and fair benchmarking across the video QA and temporal evidence retrieval research communities (Hu et al., 30 Nov 2025).