UrBench: Multi-Source & Multi-View Benchmark
- UrBench is a dual-purpose benchmark suite that assesses MLLMs on cross-source news synthesis and multi-view semantic geometric reasoning.
- The news understanding benchmark (VNU-Bench) challenges models with integrating diverse multimedia sources, while the multi-view scene benchmark tests spatial alignment and geometric accuracy.
- Baseline results reveal significant performance gaps compared to human accuracy, underscoring the need for models with specialized multi-modal integration and geometric awareness.
UrBench is a term that references two unrelated benchmarks—VNU-Bench (also known as All-Angles Bench) for multi-source news video understanding, and another All-Angles Bench for multi-view geometric and semantic reasoning in real-world 3D scenes. Both are designed to quantify reasoning performance in multi-modal LLMs (MLLMs) by systematically evaluating their capacity for information synthesis and cross-view or cross-source integration, but differ fundamentally in task domain and construction (Liu et al., 6 Jan 2026, Yeh et al., 21 Apr 2025).
1. Conceptual Overview and Scope
All-Angles Bench (VNU-Bench) is the first benchmark expressly targeting a model’s ability to synthesize and reason over the same news event as reported, depicted, and narrated by multiple outlets in multiple modalities (video, transcript, audio). This stands in contrast to prior video-QA and image-QA benchmarks, which treat reports in isolation and do not require cross-source or cross-modal reconciliation. Robust news understanding, such as verifying conflicting claims or constructing a coherent event timeline from dispersed reports, requires integrating and aligning partial and potentially contradictory fragments from different multimedia sources. VNU-Bench addresses this gap.
Another All-Angles Bench targets multi-view geometric reasoning, evaluating whether MLLMs can align semantic and geometric information across multiple spatial viewpoints, with applications to 3D perception in embodied or robotic agents (Yeh et al., 21 Apr 2025).
2. Construction Pipeline and Quality Control
VNU-Bench: News Understanding
- News Group Construction: Events are selected across diverse domains (World, US & Canada, Business & Economy, Science & Technology, Climate & Environment, Health, Culture & Arts). Volunteers aggregate YouTube-sourced news videos by formulating search queries containing a named entity, action, and context keyword, then grouping 3–5 distinct outlet videos per event.
- Frame Selection: Each video yields 8 candidate frames (sampled at 1 fps). Using GPT-5, frames are captioned and scored for multi-source reasoning utility; the top-scoring frame per temporal segment is retained.
- QA Drafting (Hybrid Human–Model): An initial set of 150 QAs is hand-constructed, informing a question taxonomy (10 types) and style constraints. The taxonomy—with four representative frames and associated transcripts—is provided to GPT-5 to generate thousands of draft QAs in a standardized JSON schema.
- Automated Filtering:
- Single-Source Solvability: QAs answerable with ≥0.8 confidence by any MLLM model from a single video are discarded.
- Ambiguity Analysis: Alternate options are scored by MLLMs for near-correctness; QAs with ambiguity (severity ≥0.6) are removed.
- Difficulty Filtering: QAs trivially answered (>0.9 confidence) by all models with full input are culled.
- Human Verification: Remaining QAs are rated by volunteers along correctness, naturalness, and task compliance, with final retention of 2,501 high-quality QAs spanning 429 news groups and 1,405 videos, with Fleiss’s κ of 0.62 across all dimensions.
All-Angles Bench: Multi-View Scene Reasoning
- Source Data: 90 multi-view real scenes, drawing on datasets like Ego-Exo4D and EgoHumans, leveraging 4–5 spatially dispersed RGB frames per scene.
- QA Generation and Validation: GPT-4o drafts task-specific QAs, refined by PhD-annotators to remove ambiguity and hallucinated distractors, with independent double annotation and random audits.
3. Question Taxonomy and Task Design
VNU-Bench
The taxonomy covers two conceptual categories, each with five types:
- Multi-source Comparison (T1–T5): Main claim, event details, visuals, narrative angle, and multi-modal presentation comparison.
- Cross-source Integration (T6–T10): Evidence integration spanning modalities and sources, conflict detection, temporal ordering, narrative reconstruction, and multi-source summary.
Each QA is strictly constructed to require true cross-source integration:
| Type (ID) | Category | Sample Reasoning Axis |
|---|---|---|
| T1 | Compare | Headline/main claim alignment |
| T7 | Integrate | Detect visual-claim conflicts across vids |
Multi-View Scene Bench
Task suite (all in three-choice multiple-choice format):
- Counting unique object instances across views.
- Attribute identification of scene objects between views.
- Relative distance between object and camera centers.
- Relative direction analysis.
- Trajectory prediction after object manipulation.
- Camera pose/top-down layout estimation.
Each task is coupled with precise mathematical formalization, e.g., set cardinality for counting, or for trajectory in new coordinates (Yeh et al., 21 Apr 2025).
4. Dataset Statistics and Input Formatting
VNU-Bench
- Total questions: 2,501
- News groups: 429
- Videos: 1,405 (avg. 3.65 min)
- Per-type QAs: ∼250 each (T1–T10)
- Domain spread: Business/Economy (566), Climate/Environment (339), Culture/Arts (465), Health (306), Science/Technology (384), US/Canada (254), World (187)
Frame ablation reveals accuracy peaks at ∼6 frames/video; context overload observed beyond that. Higher input resolution improves accuracy monotonically.
Multi-View Scene Bench
- Scenes: 90
- Questions: 2,132
- Annotation pairs: ~85% of non-counting QAs have paired variants, supporting cross-view consistency evaluation.
5. Evaluation Protocol and Baseline Results
For both, zero-shot multiple-choice accuracy is the principal metric:
VNU-Bench
- Closed-source MLLMs: Gemini-2.5-Pro (60.17%), Claude-4.5-Sonnet (58.89%), Gemini-2.5-Flash (54.44%)
- Open-source: Qwen3-VL-30B (56.14%), MiniCPM-V-4.5-9B (54.97%), Qwen3-VL-8B (54.06%)
- Question category split: T1–T5 comparison tasks outperform T6–T10 integration, with ∼5–10 point drops (integration being harder).
- Domain analysis: Culture/Arts is easier; World and Health domains exhibit more challenging, variable inference.
Performance bottlenecks are pronounced in T7 (conflict detection, max. 48.3% even for Gemini-2.5-Pro).
Multi-View Scene Bench
- Human average: ≈82%
- Best MLLMs: ≈60%
- By task: Relative distance and attribute identification reach ~81–80%; camera pose estimation remains a major unsolved challenge (models ≤44%, humans ≈89%).
6. Key Insights and Research Implications
VNU-Bench demonstrates substantial gaps between current MLLMs and human-level news synthesis, most acute in cross-source integration, conflict detection, and narrative reconstruction. The news understanding domain highlights the absence of architectures able to perform cross-document and cross-modal alignment. A moderate number of frames boosts model accuracy, but excessive context degrades performance due to overload.
The multi-view geometric reasoning variant of All-Angles Bench reveals similar gaps, particularly in cross-view correspondence under partial occlusion and coarse camera-pose estimation. Current models tend to aggregate per-view predictions rather than reconcile them globally; even advanced prompting (Zero-Shot CoT) yields only incremental improvements, indicating the need for architectures with explicit geometric awareness (Yeh et al., 21 Apr 2025).
7. Future Directions
Proposed avenues from both benchmarks include:
- Architectures specialized for cross-document/video alignment and contradiction resolution, potentially integrating geometric reasoning blocks or 3D scene-graphs.
- Retrieval-augmented pipelines to fill narrative or visual gaps.
- Fine-tuning strategies explicitly targeting cross-source or cross-view integration.
- Extensions beyond English/mainstream or toward non-news, non-standard content (e.g., local news, user-generated media).
- Moving beyond multiple-choice toward open-ended summarization and evidence citation to better approximate human-level synthesis.
- For geometric reasoning, augmenting training with view-synthesis, 3D pre-training, or hybrid implicit/explicit pose estimation modules.
Both benchmarks serve as rigorous probes into the present limits of multimodal LLM architectures, presenting robust challenges that incentivize new methodological innovations and dataset extensions in the pursuit of human-level multi-source and multi-view understanding (Liu et al., 6 Jan 2026, Yeh et al., 21 Apr 2025).