All-Angles Bench: Multi-View & QCD Benchmark

Updated 2 June 2026

All-Angles Bench is a dual benchmark assessing multi-view reasoning in MLLMs and QCD cusp anomalous dimensions, offering a unified test for angular consistency.
In MLLMs, it evaluates spatial tasks such as counting, attribute identification, relative distance, and camera pose using paired questions to detect multi-view inconsistencies.
For QCD, it benchmarks analytic calculations of the fermionic cusp anomalous dimension across all Euclidean and Minkowskian angles to validate resummation models.

The term “All-Angles Bench” refers to two prominent, yet domain-divergent, rigorous benchmarks in contemporary technical research—one in multi-view visual grounding and geometric reasoning for Multi-Modal LLMs (MLLMs), the other in the analytic study of the QCD cusp anomalous dimension for arbitrary Euclidean and Minkowskian angles. The concept is unified by its ambition to evaluate systems’ or models’ fidelity, accuracy, and generalization across the full range of relevant angular or viewpoint configurations, rather than at select special cases. The following outlines both meanings in their disciplinary contexts and impact.

1. All-Angles Bench for Multi-View Understanding in MLLMs

All-Angles Bench, introduced in "Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs" (Yeh et al., 21 Apr 2025), is a curated, publicly released benchmark for systematically assessing geometric and referential consistency of MLLMs across diverse camera viewpoints in real-world scenes. Developed in response to the observation that modern MLLMs (e.g., GPT-4o, Gemini-2.0-Flash, Claude-3.7-Sonnet) excel at isolated image reasoning but fail at multi-view geometric consistency, this benchmark exposes and quantifies key failure modes in multi-view correspondence and spatial reasoning.

2. Benchmark Construction: Dataset and Annotation Protocol

All-Angles Bench consists of 90 diverse real-world scenes (predominantly from Ego-Exo4D and EgoHumans) with 4–5 spatially dispersed camera views (796×448 px). The dataset comprises 2,132 expert-annotated multiple-choice question–answer pairs, each with precisely three answer candidates, spanning six task types: counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation. Draft questions were proposed by an MLLM (GPT-4o), then refined by human experts through collaborative annotation (~300 person-hours), including ambiguity removal, distractor quality control, and systematic creation of paired questions (view-swapped or rephrased analogues) for five of the six tasks (excluding counting). This paired protocol enables direct measurement of consistency and invariance under viewpoint changes—85.3% of non-counting questions possess paired forms.

3. Task Taxonomy and Geometric Probing

The six tasks instantiate foundational multi-view reasoning demands:

Task Type	Input Structure	Geometric Competence Probed
Counting	All views	Cardinality union; no double-counting/occlusion
Attribute Identification	Two views	Cross-view object matching under perspective
Relative Distance	Two views	Depth/3D proximity estimation across cameras
Relative Direction	Two views, orientation	Transfer and alignment of object/camera axes
Object Manipulation	Two views	Spatial trajectory mapping across frames
Camera Pose Estimation	Four views	3D camera layout and topological ordering

Each is operationalized as a three-option multiple-choice, ensuring robust automatic evaluation and reducing label ambiguity. Notably, the benchmark's paired-question design for all but counting systematically interrogates model invariance and correspondence beyond rote or single-view pattern matching.

4. Evaluation Metrics and Consistency Scoring

Performance is measured by simple accuracy: $\mathrm{Accuracy} = \frac{\#\text{correct predictions}}{\#\text{total questions}}$ Camera pose tasks may also employ mean angular error: $\theta_{\mathrm{err}} = \arccos\left(\langle R_{\mathrm{pred}}, R_{\mathrm{gt}} \rangle\right)$ Paired-question protocols introduce inconsistency rate: $\mathrm{Inconsistency} = \frac{\# \text{IC}}{\# \text{paired questions}},$ with consistency score $1 - \mathrm{Inconsistency}$ . Here, IC denotes pairs with a correct answer in only one form, exposing failures of true multi-view reasoning even when single answers appear correct.

5. Comparative Results and Identified Failure Modes

Experimental evaluation on 27 MLLMs (including leading commercial and open-source systems) against human baselines reveals persistent, wide gaps:

Method	Avg Accuracy	Attr	Pose	Count	Manip	RelDir	RelDist
Human	82.0	93.3	88.9	86.3	72.0	79.5	95.7
GPT-4o	52.4	66.7	16.7	52.9	40.0	53.8	63.8
Gemini-2.0-Flash	58.4	62.2	38.9	64.7	48.0	56.4	68.1
Claude-3.7-Sonnet	52.8	60.0	38.9	37.3	38.0	56.4	80.9
InternVL2.5-38B	60.8	73.3	27.8	70.6	42.0	64.1	68.1
Qwen2.5-VL-72B	58.4	73.3	22.2	52.9	44.0	61.5	76.6

Camera pose estimation remains the most challenging category (models ~15–40% vs. 89% human). Even open-source models with video-oriented pretraining surpass closed models in certain orientation-sensitive tasks, suggesting benefits from spatiotemporal representations. High inconsistency rates (often 40–70% for relative distance and direction) reveal that correct single-answer performance does not reliably indicate genuine multi-view understanding.

Core failure patterns include:

Counting under partial visibility, where MLLMs treat each view independently or take max counts, failing to reconcile sets.
Camera pose estimation, where MLLMs justify but severely misorder views and misalign spatial anchors, propagating errors to downstream direction and manipulation tasks.

"Identification CoT" (chain-of-thought with explicit listing/cross-referencing of objects) increases partial-count accuracy by ≈20% for GPT-4o but is ineffective for already stronger models, indicating the limited efficacy of prompting-only solutions.

6. Implications for Model Architecture and Training

Critical recommendations to bridge the identified gaps include:

Incorporating explicit modules or pretraining objectives for cross-view geometric consistency, e.g., multi-view contrastive learning or epipolar constraint enforcement.
Augmenting training corpora with richly paired multi-view scenes and explicit 3D layout annotations.
Integrating vision-to-3D backbones (depth and pose regressors) with the MLLM core to improve spatial generalization.
Developing prompting strategies or lightweight adapter layers that enforce cross-view feature matching, beyond language-only reasoning pipelines.

These interventions are motivated by All-Angles Bench’s demonstration that architectural or data-centric improvements, rather than solely prompting, are critical for achieving human-level multi-view proficiency (Yeh et al., 21 Apr 2025).

7. "All-Angles Bench" in Four-Loop QCD Cusp Anomalous Dimension

In high-energy theory, "all-angles bench" refers to the analytic calculation and benchmarking of the QCD fermionic cusp anomalous dimension across all values of the Euclidean angle $\phi$ , from infinitesimal to light-like and antiparallel limits (Brüser et al., 2019). This all-angles evaluation tests conjectured universal formulae for angular dependence in the four-loop regime over the complete kinematic range, not just isolated expansions. The resulting benchmark tabulates small- $\phi$ expansions (to order $\phi^6$ ), large-angle (light-like) limits, and antiparallel cases, for all seven fermionic color structures. Concordance and discrepancies with universal conjectures are fully catalogued, providing stringent validation for QCD resummation applications and exposing the boundaries of current analytic control (Brüser et al., 2019).

A plausible implication is that "All-Angles Bench"—whether in machine perception or quantum field theory—serves as an acid test for models’ ability to generalize under nontrivial angular, viewpoint, or geometric transformations. In both disciplines, success on such a benchmark is a prerequisite to claims of invariance, correspondence, or full-scene understanding, and the practice of benchmarking across all physically relevant angles is essential for rigorous progress.

Markdown Report Issue Upgrade to Chat

References (2)

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs (2025)

Matter dependence of the four-loop QCD cusp anomalous dimension: from small angles to all angles (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to All-Angles Bench.