UFVideo-Bench: Unified Video Benchmark
- UFVideo-Bench is a task suite that evaluates unified multi-grained video understanding by integrating global reasoning, pixel-level segmentation, and temporal localization tasks.
- It features three collaborative evaluation tasks—PixRQA, PixHQA, and PixTRQA—designed to rigorously assess spatial, hybrid, and temporal reasoning capabilities.
- The benchmark leverages diverse datasets and standardized metrics to enable direct comparisons between multi-modal LLMs and domain-specific specialists, driving innovation in video-language research.
UFVideo-Bench is a task suite and benchmark designed to evaluate the unified multi-grained cooperative understanding abilities of large video-LLMs. Developed as part of the UFVideo framework, it addresses key limitations in video understanding evaluation by structuring tasks across global, pixel, and temporal dimensions, and providing standardized, challenging collaborative protocols. UFVideo-Bench enables rigorous comparison between multi-modal LLMs and domain-specific specialists across diverse facets of video reasoning, segmentation, and grounding (Pan et al., 12 Dec 2025).
1. Motivation for Multi-Grained Video Benchmarking
The advancement of Video LLMs has exposed significant gaps in comprehensive video perception, as prior benchmarks primarily target specialized or isolated tasks (e.g., global QA, pixel-level segmentation, temporal localization) without integrated evaluation across semantic scales. UFVideo-Bench was introduced to provide unified cooperative benchmarks covering:
- Global Reasoning: Holistic video question-answering.
- Pixel-Level Understanding: Fine-grained referring and segmentation tasks.
- Temporal Localization: Reasoning about actions and events within explicit time intervals.
This structure underlines the necessity of joint evaluation protocols for models that purport to understand both high-level narratives and low-level spatio-temporal details simultaneously (Pan et al., 12 Dec 2025).
2. UFVideo-Bench: Task Composition and Design
UFVideo-Bench is subdivided into three collaborative evaluation tasks:
| Task Name | Primary Focus | Output Type |
|---|---|---|
| PixRQA | Referring Reasoning + Segmentation | Phrase + Pixel Mask |
| PixHQA | Hybrid QA w/ Temporal Interval | Score, Reasoning, Mask |
| PixTRQA | Temporal Reasoning + Segmentation | Temporal tIoU + Mask |
- PixRQA (Pixel Referring Question Answering): Requires models to generate both descriptive answers and per-object segmentation masks for spatially grounded objects.
- PixHQA (Hybrid Question Answering): Combines phrase reasoning, interval specificity, and mask prediction.
- PixTRQA (Pixel Temporal Reasoning Question Answering): Targets temporal localization jointly with spatial reasoning, demanding outputs such as temporal IoU and segmentation frames.
Each protocol integrates prompt templates using <Ref>, [SEG], and <Temp-x> tokens to coordinate object references and temporal spans (Pan et al., 12 Dec 2025).
3. Data Sources and Evaluation Protocols
UFVideo-Bench leverages samples constructed or aggregated from public benchmarks and custom annotation schemes:
- Datasets Used: VideoRefer-700K for pixel-level tasks, MeViS, Ref-DAVIS17, YouTube-VOS for segmentation, Charades-STA, DiDeMo for temporal reasoning.
- Sample Construction: Protocol applies both ground-truth masks and temporal labels, ensuring complex multi-modal reasoning.
Evaluation metrics include:
- SC, AD, TD, HD: Semantic Consistency, Answer Diversity, Temporal Diversity, Human-Defined scores, averaged for pixel-level tasks.
- 𝓙ℱ: Segmentation F-score.
- [email protected]: Temporal Intersection over Union for interval localization.
- SAvg: Average reasoning score from human evaluators.
All results are reported against standardized baselines such as UniPixel, RGA3, Qwen2-VL7B, and GPT-4o (Pan et al., 12 Dec 2025).
4. Benchmarked Model Performance and Comparative Analysis
In empirical studies, UFVideo (7B) demonstrates superior multi-grained video understanding on UFVideo-Bench:
| Task | UFVideo SOTA | Closest Baseline |
|---|---|---|
| PixRQA | 𝓙ℱ=53.39, SAvg=3.35 | GPT-4o (SAvg 2.58) |
| PixHQA w/T | SAvg_wT=4.22 | Qwen3-32B=4.28 |
| PixTRQA | [email protected]=51.61, 𝓙ℱ=32.25, SAvg=4.13 | Qwen3-32B (SAvg 3.94) |
UFVideo models consistently outperform both multi-modal and specialist baselines across segmentation, reasoning, and temporal localization. Qualitative cases highlight accurate joint phrase/mask answers and temporally specific segmentation, underscoring the unified decoding mechanism’s impact (Pan et al., 12 Dec 2025).
5. Architectural and Training Protocols Supporting UFVideo-Bench
UFVideo integrates multi-stage training across global, pixel, and temporal objectives:
- Visual Encoder: SigLIP-ViT-L/14 for video frames and mask prompts.
- LLM Backbone: Frozen VideoRefer-7B, extended with token alignment (<Temp-τ>, <Ref>, <Seg>).
- Segmentation Decoder: SAM2-based module for direct pixel mask output.
- Multi-Task Loss: Combines next-token prediction for text/temporal outputs and BCE+Dice for mask regression.
Prompt templates orchestrate interaction between video, text, and token embeddings, facilitating multi-granular supervision within a single network (Pan et al., 12 Dec 2025).
6. Limitations, Insights, and Future Extension
UFVideo-Bench’s findings indicate several strengths and ongoing challenges:
- Mutual Multi-Granularity Gains: Training across global, pixel, and temporal tasks improves model robustness and semantic transfer.
- Token Design: Relative temporal tokens (<Temp τ>) are critical for effective moment localization.
- Generalization Limitations: Current focus is on video-level tasks; richer dense captioning and longer videos (>100 s) present challenges for future implementations.
Planned extensions involve enhancing mask-token reasoning, scaling temporal architectures, and integrating interactive multi-round dialogue schemes (Pan et al., 12 Dec 2025).
7. Significance in Video LLM Benchmarking Landscape
UFVideo-Bench establishes a definitive evaluation framework for unified video-LLMs, filling the gap left by task-specific benchmarks. Its collaborative, multi-grained protocols catalyze developmental and comparative research into holistic video understanding, contributing foundational tools for subsequent innovations in general-purpose video LLMs (Pan et al., 12 Dec 2025).
Editor’s term: UFVideo-Bench refers strictly to the above-described composite suite, as presented in (Pan et al., 12 Dec 2025). All statistics, protocols, and results are referenced verbatim from the cited source.