Papers
Topics
Authors
Recent
2000 character limit reached

SAGE-Bench: Long-Video Reasoning Benchmark

Updated 18 December 2025
  • SAGE-Bench is a long-video reasoning benchmark that evaluates multimodal AI systems on real-world videos using both multiple-choice and open-ended QnA.
  • It features 1,744 verified QnA pairs across varied video durations and domains, emphasizing multi-turn reasoning and cross-modal integration.
  • Empirical results highlight that reinforcement-trained agents and iterative reasoning strategies notably improve accuracy on long-duration videos.

SAGE-Bench is a long-video reasoning benchmark designed to evaluate the capabilities of multimodal AI systems in handling real-world, any-horizon video understanding tasks. As introduced in "SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning" (Jain et al., 15 Dec 2025), SAGE-Bench provides both multiple-choice and open-ended question-answer (QnA) evaluation for videos with durations averaging over 12 minutes, encompassing entertainment and educational domains with complex temporal and multimodal dependencies. The benchmark is central to measuring progress in video agents capable of both single- and multi-turn reasoning strategies, reflecting the demands of practical video applications.

1. Dataset Structure and Coverage

SAGE-Bench comprises 1,744 QnA pairs distributed across diverse video segments. The dataset is characterized by coverage across task format, input modality, domain, and temporal span:

  • Question Types: 802 multiple-choice questions and 942 open-ended questions.
  • Modalities: 1,216 visual-only, 134 verbal-only (using speech transcripts), and 394 combined visual+verbal questions.
  • Temporal Distribution: The average video duration is 727 seconds (approximately 12 minutes, 7 seconds), with samples distributed as shown:
Duration (seconds) # Videos
0–60 261
60–180 390
180–300 116
300–600 186
600–1200 484
1200–2400 147
2400+ 180
  • Domains: Five core genres from popular YouTube content: sports (e.g., Formula 1 highlights), food/cooking, comedy/entertainment (e.g., The Daily Show, Mr Bean), science/education (e.g., Veritasium, Kurzgesagt), and travel/walking tours.
  • Coverage: Questions target visual, verbal, and cross-modal reasoning, involving fine-grained event localization, semantic grounding, and implicit temporal reasoning.

The median sample falls within the 300–600 second bucket, ensuring a strong emphasis on mid- to long-horizon video understanding.

2. Data Collection and Generation Protocol

SAGE-Bench was constructed from publicly available YouTube videos and YouTube Shorts drawn from 13 prominent channels spanning the benchmark's five domains. All content adheres to standard YouTube licensing; no proprietary or premium-licensed material is included.

QnA pairs were synthetically generated using Google’s Gemini-2.5-Flash LLM, which is capable of processing long-context video and transcript data. For each video, the model was prompted to generate 10–20 QnA pairs that span the video’s entire temporal extent, with each QnA annotated for:

  • Task type (open-ended or MCQ)
  • Input modality (visual, verbal, or both)
  • Difficulty (easy, medium, hard)
  • Start and end timestamps (HH:MM:SS)
  • Percent_video_parsed, ensuring uniform temporal coverage across the dataset

Manual verification (sample size 1,744) yielded a low error rate (~5%). This automated system reduced both cost and generation latency by 100× and 10×, respectively, compared to manual or subclip-based methods.

A curated tool-call trajectory dataset was generated by running the SAGE-MM orchestrator (initialized as Gemini-2.5-Flash) through the SAGE system, collecting unique (input → action) pairs for cold-start supervised finetuning prior to reinforcement learning.

3. Evaluation Splits and Protocol

SAGE-Bench operates exclusively as an evaluation set, with 1,744 verified QnA pairs strictly disjoint from the 99,100 synthetic QnA pairs used for training. Videos may overlap between sets, but no additional holds or cross-validation splits are applied.

The evaluation metric is accuracy, with scoring performed by an LLM-as-judge panel (GPT-4o) that determines if a model’s answer matches the human-verified ground truth (binary verdict: "True" or "False"): Accuracy=#verdict = "True"Total Samples\mathrm{Accuracy} = \frac{\# \text{verdict = "True"}}{\text{Total Samples}}

No partial credit or gradations are reported; open-ended and multiple-choice responses are judged identically by the LLM panel.

4. Model Performance and Analytical Results

Empirical evaluation demonstrates that SAGE-Bench discriminates sharply between agent architectures and reasoning strategies, with strongest differences on long-horizon and open-ended tasks.

Overall and Long-Video Accuracies

Model/System All Samples Long Videos (>10 min)
Qwen3-VL-8B-Instruct 64.9% 55.0% (600–1200 s)
SAGE (Qwen3, RL) 68.0% 63.2% (600–1200 s)
SAGE-Flash 71.8% 69.6% (600–1200 s)
  • For videos longer than 1,200 seconds, absolute gains are 6–14.6 percentage points over the single-turn baseline.
  • Multi-turn, any-horizon behavior correlates with improved performance as duration increases; e.g., SAGE has an average of ∼2 turns for <60 s videos and ∼3.5 turns for >2,400 s.

Representative queries from SAGE-Bench include:

  1. Visual (open-ended, 15min F1 highlight): “How does the Ferrari livery look this year?” (requires fine-grained visual parsing)
  2. Verbal (MCQ, 2min political speech): “Which phrase did the speaker use to close his announcement?” (judgeable via ASR on a transcript span)
  3. Mixed (cooking video): “When the chef says ‘simmer for 20 minutes’, what appliance is he using visually?” (requires cross-modal fusion)

Tool Use and Ablation

Ablation of ASR (automatic speech recognition) results in the largest performance drops on verbal-only queries, while frame extraction removal most degrades visual-only performance.

This suggests that robust multi-modal tool integration is required to address SAGE-Bench's full scope.

5. Benchmarking Challenges and Diagnostic Insights

SAGE-Bench reveals that:

  • Single-turn or video-only models degrade rapidly on videos longer than a few minutes, primarily due to poor temporal grounding and inability to efficiently navigate or abstract long frame sequences.
  • SAGE-MM (the orchestrator), using reinforcement learning, learns to balance single-turn efficiency for short/easy clips and iterative multi-turn looping for challenging long-duration samples.
  • Average number of reasoning turns scales sublinearly with video length, demonstrating adaptive strategy learning.

Performance gaps highlight the need for:

  • Any-horizon decision modules (to switch between fast, shallow skimming and deep, iterative reasoning)
  • Integration of multi-source evidence (visual frames, transcripts, timestamp-aware cues)
  • Efficient tool-calling, especially for high-latency ASR and frame pipelines

6. Applications and Broader Implications

SAGE-Bench targets the evaluation and development of real-world multimedia assistants, especially in domains requiring long-range event linking, cross-modal integration, and fused conversational plus factographic user experiences. Its design reflects end-user demand for robust query answering over highly variable entertainment and educational content, pushing beyond traditional benchmarks confined to short, richly annotated clips.

A plausible implication is the necessity for any-horizon agent architectures that couple rapid summarization and focused exploration, reflecting human video consumption/adaptation patterns in dynamic, lengthy input settings.

7. Conclusions and Prospective Directions

SAGE-Bench establishes a strongly validated testbed for next-generation video reasoning systems, with manually-verified, synthetic QnA coverage across five entertainment and informational domains, and a wide spectrum of video durations up to 40+ minutes. The benchmark challenges both single- and multi-turn agent architectures and reveals substantial gains from reinforcement-training strategies and multi-tool integration.

Extensions to SAGE-Bench may include expansion into additional domains, enriched conversational interaction types (multi-user, real-time feedback), and benchmarking under more diverse language and visual corruptions. Its role in driving any-horizon, multi-modal agent research for real-world use cases is underscored by persistent performance gaps and the statistically validated improvements yielded by architectural advances in agents evaluated on this benchmark (Jain et al., 15 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to SAGE-Bench.