SpecTemp-80K Dataset: Dual-Level Video QA

Updated 2 December 2025

SpecTemp-80K is a large-scale, dual-level annotated dataset comprising 80,142 video–QA triplets from nine diverse corpora, providing both segment- and frame-level supervision.
Its annotation protocol synchronizes coarse evidence spans with fine-grained frame-level cues, enabling precise temporal evidence localization for multi-modal language models.
The dataset enhances video QA by improving IoU for segment localization and reducing frame processing latency, offering significant gains over traditional frame-centric methods.

SpecTemp-80K is a large-scale, dual-level annotated dataset for speculative temporal reasoning in long video understanding. Designed to support the training and evaluation of cooperative dual-model frameworks, notably the SpecTemp architecture, the dataset contains 80,142 video–question–answer (QA) triplets spanning a broad temporal spectrum and nine distinct video corpora. Each sample provides synchronized supervision at both coarse (segment-level) and fine (frame-level) temporal resolutions, enabling efficient and accurate temporal evidence localization and reasoning for video multi-modal LLMs (MLLMs) (Hu et al., 30 Nov 2025).

1. Construction and Scope

SpecTemp-80K was constructed to address the limitations of the “thinking-with-frames” paradigm in long video reasoning by providing rich, multi-granular supervision for RL and other learning paradigms that require both segment- and frame-level information. The dataset aggregates N=80,142 video–QA triplets from nine video sources across three temporal categories:

Short-form (<1 min): CLEVRER, PerceptionTest, STAR, NeXT-GQA (32.4% of videos)
Medium-length (1–10 min): LLaVA-Video, ActivityNet, YouCook2 (51.8% of videos)
Long-form (>10 min): MovieChat, Ego4D (15.8% of videos)

The diversity of source domains and temporal regimes (sub-minute to tens of minutes) ensures coverage of a wide range of real-world video understanding scenarios. Precise total dataset duration in hours is not reported, but the proportional splits emphasize robust representation of both brief and extended contexts.

2. Annotation Protocol

Each QA sample is annotated at two synchronized levels:

(a) Coarse Evidence Spans ( $S_i$ ):

Contiguous temporal segments relevant to answering the query, denoted

$S_i = \{ (t_\text{start}^j, t_\text{end}^j) \mid 0 \leq t_\text{start}^j < t_\text{end}^j \leq d_i, \; j=1\dots M_i \}$

where $d_i$ is video duration. Each $(t_\text{start}, t_\text{end})$ corresponds to a “<segment>” token in the model's reasoning trajectory.

(b) Fine-Grained Frame-Level Evidence ( $F_i$ ):

Within each coarse span, a subset of representative frames is selected as most informative:

$F_i = \{ f_k \mid f_k \in \mathbb{N}, \; t(f_k) \in \bigcup_j [t_\text{start}^j, t_\text{end}^j], \; k=1\dots K_i \}$

with $t(f_k)$ as frame timestamp. These are surfaced as “<frame>” tokens.

Annotation Procedure:

Preliminary Annotations via GPT-4o generate scene- and frame-level captions for events and visual cues.

Trajectory Generation: GPT-4o simulates multi-round speculation–verification traces marked by > , <segment>, <frame>, and <answer>.

Manual Validation: A random subset is manually checked for temporal and frame correctness; errors are culled.

Empirical Statistics:

Average $\langle M_i \rangle \approx 1.8$ spans per video (each question typically triggers one initial and up to 2 additional segment predictions)

Median span length ≈ 15 seconds (long-tailed toward much longer spans)

Average $\langle K_i \rangle \approx 4.2$ frames per video (draft model selects 2 frames per iteration, up to 3 iterations)

3. Data Format and Mathematical Schema

SpecTemp-80K's organization supports streamlined loading and evaluation:

Directory Structure:

train/, val/, test/ (80%/10%/10% splits)

videos/: video files or links

annotations.jsonl: one sample per line

JSONL Schema:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{
  "video_id": "string",
  "duration": float,
  "question": "string",
  "answer": "string",
  "spans": [ {"start": float, "end": float}, ... ],
  "frames": [ int, ... ],
  "trajectory": [
    {
      "round": int,
      "think": "string",
      "segment": [start, end] or null,
      "draft_frames": [f1, f2]
    }, ...
  ]
}
Mathematical Formalization:

$V = \{ v_1, \dots, v_N \}$ , $N=80\,142$

For each $v_i$ :

$d_i \in \mathbb{R}^+$

$S_i$ and $F_i$ as above

Full dataset: $A = \{ (v_i, q_i, a_i, S_i, F_i, \text{Traj}_i) \}_{i=1\dots N}$

4. Splits, Annotation Statistics, and Distributions

The dataset is partitioned as follows:

Split Samples ⟨#Spans⟩ ⟨#Frames⟩

Train 64,113 1.79 4.25

Val 8,014 1.82 4.18

Test 8,015 1.81 4.22

Segment Length and Frame Density:

Coarse-span length $L$ has a right-skewed distribution $P(L)$ : ~50% of spans <12 seconds, 10% >60 seconds.

Within spans, frame density ( $K_i/d_i$ ) ≈ 0.1 frames/sec, consistent with 2 frames per ~20 seconds.

This suggests the annotation process closely mirrors real-world sparsity in answer-relevant evidence, supporting efficient learning of temporal localization.

5. Applications and Benchmarking

Recommended Use Cases:

Temporal Evidence Retrieval: Training models to localize relevant segments.

Frame-Level Summarization: Learning to select key frames that condense essential information.

Long-Video Question Answering: Joint training with coarse and fine temporal cues.

Reinforcement Learning for Frame Sampling: Policy optimization using dual-level annotated supervision.

Benchmark Results:

Training SpecTemp on SpecTemp-80K achieves up to a +12% absolute gain (on Video-Holmes benchmark), and 19–23% faster inference compared to prior “thinking-with-frames” paradigms.

IoU for segment localization increases by 3–5%; frame processing latency drops correspondingly.

A plausible implication is that models leveraging SpecTemp-80K’s structured dual supervision are intrinsically better at balancing accuracy and efficiency in QA over long video contexts.

Integration Strategies:

Pre-train or fine-tune multi-modal LLMs using synchronized span/frame labels.

Use IoU and information gain measured on $S_i$ and $F_i$ for evaluating novel selection policies.

Adopt standardized splits for reporting accuracy, localization IoU, and frame selection precision.

6. Significance and Impact

SpecTemp-80K provides a critical resource for advancing research on efficient, scalable, speculation-driven video understanding. Its explicit, dual-level annotations and balanced coverage across short, medium, and long video regimes make it well-suited for developing RL and LLM-based approaches that require both global temporal coherence and fine-grained perceptual grounding. The dataset’s format and rigorously curated splits facilitate reproducibility and fair benchmarking across the video QA and temporal evidence retrieval research communities (Hu et al., 30 Nov 2025).

Split	Samples	⟨#Spans⟩	⟨#Frames⟩
Train	64,113	1.79	4.25
Val	8,014	1.82	4.18
Test	8,015	1.81	4.22

PDF Markdown Chat (Pro)

References (1)

Thinking with Drafts: Speculative Temporal Reasoning for Efficient Long Video Understanding (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to SpecTemp-80K Dataset.

SpecTemp-80K Dataset: Dual-Level Video QA

1. Construction and Scope

2. Annotation Protocol

3. Data Format and Mathematical Schema

4. Splits, Annotation Statistics, and Distributions

5. Applications and Benchmarking

6. Significance and Impact

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics