Papers
Topics
Authors
Recent
Search
2000 character limit reached

UrBench: Multi-Source & Multi-View Benchmark

Updated 2 June 2026
  • UrBench is a dual-purpose benchmark suite that assesses MLLMs on cross-source news synthesis and multi-view semantic geometric reasoning.
  • The news understanding benchmark (VNU-Bench) challenges models with integrating diverse multimedia sources, while the multi-view scene benchmark tests spatial alignment and geometric accuracy.
  • Baseline results reveal significant performance gaps compared to human accuracy, underscoring the need for models with specialized multi-modal integration and geometric awareness.

UrBench is a term that references two unrelated benchmarks—VNU-Bench (also known as All-Angles Bench) for multi-source news video understanding, and another All-Angles Bench for multi-view geometric and semantic reasoning in real-world 3D scenes. Both are designed to quantify reasoning performance in multi-modal LLMs (MLLMs) by systematically evaluating their capacity for information synthesis and cross-view or cross-source integration, but differ fundamentally in task domain and construction (Liu et al., 6 Jan 2026, Yeh et al., 21 Apr 2025).

1. Conceptual Overview and Scope

All-Angles Bench (VNU-Bench) is the first benchmark expressly targeting a model’s ability to synthesize and reason over the same news event as reported, depicted, and narrated by multiple outlets in multiple modalities (video, transcript, audio). This stands in contrast to prior video-QA and image-QA benchmarks, which treat reports in isolation and do not require cross-source or cross-modal reconciliation. Robust news understanding, such as verifying conflicting claims or constructing a coherent event timeline from dispersed reports, requires integrating and aligning partial and potentially contradictory fragments from different multimedia sources. VNU-Bench addresses this gap.

Another All-Angles Bench targets multi-view geometric reasoning, evaluating whether MLLMs can align semantic and geometric information across multiple spatial viewpoints, with applications to 3D perception in embodied or robotic agents (Yeh et al., 21 Apr 2025).

2. Construction Pipeline and Quality Control

VNU-Bench: News Understanding

  1. News Group Construction: Events are selected across diverse domains (World, US & Canada, Business & Economy, Science & Technology, Climate & Environment, Health, Culture & Arts). Volunteers aggregate YouTube-sourced news videos by formulating search queries containing a named entity, action, and context keyword, then grouping 3–5 distinct outlet videos per event.
  2. Frame Selection: Each video yields 8 candidate frames (sampled at 1 fps). Using GPT-5, frames are captioned and scored for multi-source reasoning utility; the top-scoring frame per temporal segment is retained.
  3. QA Drafting (Hybrid Human–Model): An initial set of 150 QAs is hand-constructed, informing a question taxonomy (10 types) and style constraints. The taxonomy—with four representative frames and associated transcripts—is provided to GPT-5 to generate thousands of draft QAs in a standardized JSON schema.
  4. Automated Filtering:
    • Single-Source Solvability: QAs answerable with ≥0.8 confidence by any MLLM model from a single video are discarded.
    • Ambiguity Analysis: Alternate options are scored by MLLMs for near-correctness; QAs with ambiguity (severity ≥0.6) are removed.
    • Difficulty Filtering: QAs trivially answered (>0.9 confidence) by all models with full input are culled.
  5. Human Verification: Remaining QAs are rated by volunteers along correctness, naturalness, and task compliance, with final retention of 2,501 high-quality QAs spanning 429 news groups and 1,405 videos, with Fleiss’s κ of 0.62 across all dimensions.

All-Angles Bench: Multi-View Scene Reasoning

  1. Source Data: 90 multi-view real scenes, drawing on datasets like Ego-Exo4D and EgoHumans, leveraging 4–5 spatially dispersed RGB frames per scene.
  2. QA Generation and Validation: GPT-4o drafts task-specific QAs, refined by PhD-annotators to remove ambiguity and hallucinated distractors, with independent double annotation and random audits.

3. Question Taxonomy and Task Design

VNU-Bench

The taxonomy covers two conceptual categories, each with five types:

  • Multi-source Comparison (T1–T5): Main claim, event details, visuals, narrative angle, and multi-modal presentation comparison.
  • Cross-source Integration (T6–T10): Evidence integration spanning modalities and sources, conflict detection, temporal ordering, narrative reconstruction, and multi-source summary.

Each QA is strictly constructed to require true cross-source integration:

Type (ID) Category Sample Reasoning Axis
T1 Compare Headline/main claim alignment
T7 Integrate Detect visual-claim conflicts across vids

Multi-View Scene Bench

Task suite (all in three-choice multiple-choice format):

  • Counting unique object instances across views.
  • Attribute identification of scene objects between views.
  • Relative distance between object and camera centers.
  • Relative direction analysis.
  • Trajectory prediction after object manipulation.
  • Camera pose/top-down layout estimation.

Each task is coupled with precise mathematical formalization, e.g., set cardinality  ⁣i=1NOi\bigl|\!\bigcup_{i=1}^N O_i\bigr| for counting, or Δ2=R12Δ\Delta_2 = R_{1\to2}\,\Delta for trajectory in new coordinates (Yeh et al., 21 Apr 2025).

4. Dataset Statistics and Input Formatting

VNU-Bench

  • Total questions: 2,501
  • News groups: 429
  • Videos: 1,405 (avg. 3.65 min)
  • Per-type QAs: ∼250 each (T1–T10)
  • Domain spread: Business/Economy (566), Climate/Environment (339), Culture/Arts (465), Health (306), Science/Technology (384), US/Canada (254), World (187)

Frame ablation reveals accuracy peaks at ∼6 frames/video; context overload observed beyond that. Higher input resolution improves accuracy monotonically.

Multi-View Scene Bench

  • Scenes: 90
  • Questions: 2,132
  • Annotation pairs: ~85% of non-counting QAs have paired variants, supporting cross-view consistency evaluation.

5. Evaluation Protocol and Baseline Results

For both, zero-shot multiple-choice accuracy is the principal metric:

Acc=1Qi=1Q1[y^i=yi]×100%\mathrm{Acc} = \frac{1}{Q} \sum_{i=1}^Q \mathbf{1}[\hat{y}_i = y_i] \times 100\%

VNU-Bench

  • Closed-source MLLMs: Gemini-2.5-Pro (60.17%), Claude-4.5-Sonnet (58.89%), Gemini-2.5-Flash (54.44%)
  • Open-source: Qwen3-VL-30B (56.14%), MiniCPM-V-4.5-9B (54.97%), Qwen3-VL-8B (54.06%)
  • Question category split: T1–T5 comparison tasks outperform T6–T10 integration, with ∼5–10 point drops (integration being harder).
  • Domain analysis: Culture/Arts is easier; World and Health domains exhibit more challenging, variable inference.

Performance bottlenecks are pronounced in T7 (conflict detection, max. 48.3% even for Gemini-2.5-Pro).

Multi-View Scene Bench

  • Human average: ≈82%
  • Best MLLMs: ≈60%
  • By task: Relative distance and attribute identification reach ~81–80%; camera pose estimation remains a major unsolved challenge (models ≤44%, humans ≈89%).

6. Key Insights and Research Implications

VNU-Bench demonstrates substantial gaps between current MLLMs and human-level news synthesis, most acute in cross-source integration, conflict detection, and narrative reconstruction. The news understanding domain highlights the absence of architectures able to perform cross-document and cross-modal alignment. A moderate number of frames boosts model accuracy, but excessive context degrades performance due to overload.

The multi-view geometric reasoning variant of All-Angles Bench reveals similar gaps, particularly in cross-view correspondence under partial occlusion and coarse camera-pose estimation. Current models tend to aggregate per-view predictions rather than reconcile them globally; even advanced prompting (Zero-Shot CoT) yields only incremental improvements, indicating the need for architectures with explicit geometric awareness (Yeh et al., 21 Apr 2025).

7. Future Directions

Proposed avenues from both benchmarks include:

  • Architectures specialized for cross-document/video alignment and contradiction resolution, potentially integrating geometric reasoning blocks or 3D scene-graphs.
  • Retrieval-augmented pipelines to fill narrative or visual gaps.
  • Fine-tuning strategies explicitly targeting cross-source or cross-view integration.
  • Extensions beyond English/mainstream or toward non-news, non-standard content (e.g., local news, user-generated media).
  • Moving beyond multiple-choice toward open-ended summarization and evidence citation to better approximate human-level synthesis.
  • For geometric reasoning, augmenting training with view-synthesis, 3D pre-training, or hybrid implicit/explicit pose estimation modules.

Both benchmarks serve as rigorous probes into the present limits of multimodal LLM architectures, presenting robust challenges that incentivize new methodological innovations and dataset extensions in the pursuit of human-level multi-source and multi-view understanding (Liu et al., 6 Jan 2026, Yeh et al., 21 Apr 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to UrBench.