UFVideo-Bench: Unified Video Benchmark

Updated 19 December 2025

UFVideo-Bench is a task suite that evaluates unified multi-grained video understanding by integrating global reasoning, pixel-level segmentation, and temporal localization tasks.
It features three collaborative evaluation tasks—PixRQA, PixHQA, and PixTRQA—designed to rigorously assess spatial, hybrid, and temporal reasoning capabilities.
The benchmark leverages diverse datasets and standardized metrics to enable direct comparisons between multi-modal LLMs and domain-specific specialists, driving innovation in video-language research.

UFVideo-Bench is a task suite and benchmark designed to evaluate the unified multi-grained cooperative understanding abilities of large video-LLMs. Developed as part of the UFVideo framework, it addresses key limitations in video understanding evaluation by structuring tasks across global, pixel, and temporal dimensions, and providing standardized, challenging collaborative protocols. UFVideo-Bench enables rigorous comparison between multi-modal LLMs and domain-specific specialists across diverse facets of video reasoning, segmentation, and grounding (Pan et al., 12 Dec 2025).

1. Motivation for Multi-Grained Video Benchmarking

The advancement of Video LLMs has exposed significant gaps in comprehensive video perception, as prior benchmarks primarily target specialized or isolated tasks (e.g., global QA, pixel-level segmentation, temporal localization) without integrated evaluation across semantic scales. UFVideo-Bench was introduced to provide unified cooperative benchmarks covering:

Global Reasoning: Holistic video question-answering.
Pixel-Level Understanding: Fine-grained referring and segmentation tasks.
Temporal Localization: Reasoning about actions and events within explicit time intervals.

This structure underlines the necessity of joint evaluation protocols for models that purport to understand both high-level narratives and low-level spatio-temporal details simultaneously (Pan et al., 12 Dec 2025).

2. UFVideo-Bench: Task Composition and Design

UFVideo-Bench is subdivided into three collaborative evaluation tasks:

Task Name	Primary Focus	Output Type
PixRQA	Referring Reasoning + Segmentation	Phrase + Pixel Mask
PixHQA	Hybrid QA w/ Temporal Interval	Score, Reasoning, Mask
PixTRQA	Temporal Reasoning + Segmentation	Temporal tIoU + Mask

PixRQA (Pixel Referring Question Answering): Requires models to generate both descriptive answers and per-object segmentation masks for spatially grounded objects.
PixHQA (Hybrid Question Answering): Combines phrase reasoning, interval specificity, and mask prediction.
PixTRQA (Pixel Temporal Reasoning Question Answering): Targets temporal localization jointly with spatial reasoning, demanding outputs such as temporal IoU and segmentation frames.

Each protocol integrates prompt templates using <Ref>, [SEG], and <Temp-x> tokens to coordinate object references and temporal spans (Pan et al., 12 Dec 2025).

3. Data Sources and Evaluation Protocols

UFVideo-Bench leverages samples constructed or aggregated from public benchmarks and custom annotation schemes:

Datasets Used: VideoRefer-700K for pixel-level tasks, MeViS, Ref-DAVIS17, YouTube-VOS for segmentation, Charades-STA, DiDeMo for temporal reasoning.
Sample Construction: Protocol applies both ground-truth masks and temporal labels, ensuring complex multi-modal reasoning.

Evaluation metrics include:

SC, AD, TD, HD: Semantic Consistency, Answer Diversity, Temporal Diversity, Human-Defined scores, averaged for pixel-level tasks.
𝓙ℱ: Segmentation F-score.
[email protected]: Temporal Intersection over Union for interval localization.
SAvg: Average reasoning score from human evaluators.

All results are reported against standardized baselines such as UniPixel, RGA3, Qwen2-VL7B, and GPT-4o (Pan et al., 12 Dec 2025).

4. Benchmarked Model Performance and Comparative Analysis

In empirical studies, UFVideo (7B) demonstrates superior multi-grained video understanding on UFVideo-Bench:

Task	UFVideo SOTA	Closest Baseline
PixRQA	𝓙ℱ=53.39, SAvg=3.35	GPT-4o (SAvg 2.58)
PixHQA w/T	SAvg_wT=4.22	Qwen3-32B=4.28
PixTRQA	[email protected]=51.61, 𝓙ℱ=32.25, SAvg=4.13	Qwen3-32B (SAvg 3.94)

UFVideo models consistently outperform both multi-modal and specialist baselines across segmentation, reasoning, and temporal localization. Qualitative cases highlight accurate joint phrase/mask answers and temporally specific segmentation, underscoring the unified decoding mechanism’s impact (Pan et al., 12 Dec 2025).

5. Architectural and Training Protocols Supporting UFVideo-Bench

UFVideo integrates multi-stage training across global, pixel, and temporal objectives:

Visual Encoder: SigLIP-ViT-L/14 for video frames and mask prompts.
LLM Backbone: Frozen VideoRefer-7B, extended with token alignment (<Temp-τ>, <Ref>, <Seg>).
Segmentation Decoder: SAM2-based module for direct pixel mask output.
Multi-Task Loss: Combines next-token prediction for text/temporal outputs and BCE+Dice for mask regression.

Prompt templates orchestrate interaction between video, text, and token embeddings, facilitating multi-granular supervision within a single network (Pan et al., 12 Dec 2025).

6. Limitations, Insights, and Future Extension

UFVideo-Bench’s findings indicate several strengths and ongoing challenges:

Mutual Multi-Granularity Gains: Training across global, pixel, and temporal tasks improves model robustness and semantic transfer.
Token Design: Relative temporal tokens (<Temp τ>) are critical for effective moment localization.
Generalization Limitations: Current focus is on video-level tasks; richer dense captioning and longer videos (>100 s) present challenges for future implementations.

Planned extensions involve enhancing mask-token reasoning, scaling temporal architectures, and integrating interactive multi-round dialogue schemes (Pan et al., 12 Dec 2025).

7. Significance in Video LLM Benchmarking Landscape

UFVideo-Bench establishes a definitive evaluation framework for unified video-LLMs, filling the gap left by task-specific benchmarks. Its collaborative, multi-grained protocols catalyze developmental and comparative research into holistic video understanding, contributing foundational tools for subsequent innovations in general-purpose video LLMs (Pan et al., 12 Dec 2025).

Editor’s term: UFVideo-Bench refers strictly to the above-described composite suite, as presented in (Pan et al., 12 Dec 2025). All statistics, protocols, and results are referenced verbatim from the cited source.

PDF Markdown Chat (Pro)

References (1)

UFVideo: Towards Unified Fine-Grained Video Cooperative Understanding with Large Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to UFVideo-Bench.