All-Angles Bench: Multi-View & QCD Benchmark
- All-Angles Bench is a dual benchmark assessing multi-view reasoning in MLLMs and QCD cusp anomalous dimensions, offering a unified test for angular consistency.
- In MLLMs, it evaluates spatial tasks such as counting, attribute identification, relative distance, and camera pose using paired questions to detect multi-view inconsistencies.
- For QCD, it benchmarks analytic calculations of the fermionic cusp anomalous dimension across all Euclidean and Minkowskian angles to validate resummation models.
The term “All-Angles Bench” refers to two prominent, yet domain-divergent, rigorous benchmarks in contemporary technical research—one in multi-view visual grounding and geometric reasoning for Multi-Modal LLMs (MLLMs), the other in the analytic study of the QCD cusp anomalous dimension for arbitrary Euclidean and Minkowskian angles. The concept is unified by its ambition to evaluate systems’ or models’ fidelity, accuracy, and generalization across the full range of relevant angular or viewpoint configurations, rather than at select special cases. The following outlines both meanings in their disciplinary contexts and impact.
1. All-Angles Bench for Multi-View Understanding in MLLMs
All-Angles Bench, introduced in "Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs" (Yeh et al., 21 Apr 2025), is a curated, publicly released benchmark for systematically assessing geometric and referential consistency of MLLMs across diverse camera viewpoints in real-world scenes. Developed in response to the observation that modern MLLMs (e.g., GPT-4o, Gemini-2.0-Flash, Claude-3.7-Sonnet) excel at isolated image reasoning but fail at multi-view geometric consistency, this benchmark exposes and quantifies key failure modes in multi-view correspondence and spatial reasoning.
2. Benchmark Construction: Dataset and Annotation Protocol
All-Angles Bench consists of 90 diverse real-world scenes (predominantly from Ego-Exo4D and EgoHumans) with 4–5 spatially dispersed camera views (796×448 px). The dataset comprises 2,132 expert-annotated multiple-choice question–answer pairs, each with precisely three answer candidates, spanning six task types: counting, attribute identification, relative distance, relative direction, object manipulation, and camera pose estimation. Draft questions were proposed by an MLLM (GPT-4o), then refined by human experts through collaborative annotation (~300 person-hours), including ambiguity removal, distractor quality control, and systematic creation of paired questions (view-swapped or rephrased analogues) for five of the six tasks (excluding counting). This paired protocol enables direct measurement of consistency and invariance under viewpoint changes—85.3% of non-counting questions possess paired forms.
3. Task Taxonomy and Geometric Probing
The six tasks instantiate foundational multi-view reasoning demands:
| Task Type | Input Structure | Geometric Competence Probed |
|---|---|---|
| Counting | All views | Cardinality union; no double-counting/occlusion |
| Attribute Identification | Two views | Cross-view object matching under perspective |
| Relative Distance | Two views | Depth/3D proximity estimation across cameras |
| Relative Direction | Two views, orientation | Transfer and alignment of object/camera axes |
| Object Manipulation | Two views | Spatial trajectory mapping across frames |
| Camera Pose Estimation | Four views | 3D camera layout and topological ordering |
Each is operationalized as a three-option multiple-choice, ensuring robust automatic evaluation and reducing label ambiguity. Notably, the benchmark's paired-question design for all but counting systematically interrogates model invariance and correspondence beyond rote or single-view pattern matching.
4. Evaluation Metrics and Consistency Scoring
Performance is measured by simple accuracy: Camera pose tasks may also employ mean angular error: Paired-question protocols introduce inconsistency rate: with consistency score . Here, IC denotes pairs with a correct answer in only one form, exposing failures of true multi-view reasoning even when single answers appear correct.
5. Comparative Results and Identified Failure Modes
Experimental evaluation on 27 MLLMs (including leading commercial and open-source systems) against human baselines reveals persistent, wide gaps:
| Method | Avg Accuracy | Attr | Pose | Count | Manip | RelDir | RelDist |
|---|---|---|---|---|---|---|---|
| Human | 82.0 | 93.3 | 88.9 | 86.3 | 72.0 | 79.5 | 95.7 |
| GPT-4o | 52.4 | 66.7 | 16.7 | 52.9 | 40.0 | 53.8 | 63.8 |
| Gemini-2.0-Flash | 58.4 | 62.2 | 38.9 | 64.7 | 48.0 | 56.4 | 68.1 |
| Claude-3.7-Sonnet | 52.8 | 60.0 | 38.9 | 37.3 | 38.0 | 56.4 | 80.9 |
| InternVL2.5-38B | 60.8 | 73.3 | 27.8 | 70.6 | 42.0 | 64.1 | 68.1 |
| Qwen2.5-VL-72B | 58.4 | 73.3 | 22.2 | 52.9 | 44.0 | 61.5 | 76.6 |
Camera pose estimation remains the most challenging category (models ~15–40% vs. 89% human). Even open-source models with video-oriented pretraining surpass closed models in certain orientation-sensitive tasks, suggesting benefits from spatiotemporal representations. High inconsistency rates (often 40–70% for relative distance and direction) reveal that correct single-answer performance does not reliably indicate genuine multi-view understanding.
Core failure patterns include:
- Counting under partial visibility, where MLLMs treat each view independently or take max counts, failing to reconcile sets.
- Camera pose estimation, where MLLMs justify but severely misorder views and misalign spatial anchors, propagating errors to downstream direction and manipulation tasks.
"Identification CoT" (chain-of-thought with explicit listing/cross-referencing of objects) increases partial-count accuracy by ≈20% for GPT-4o but is ineffective for already stronger models, indicating the limited efficacy of prompting-only solutions.
6. Implications for Model Architecture and Training
Critical recommendations to bridge the identified gaps include:
- Incorporating explicit modules or pretraining objectives for cross-view geometric consistency, e.g., multi-view contrastive learning or epipolar constraint enforcement.
- Augmenting training corpora with richly paired multi-view scenes and explicit 3D layout annotations.
- Integrating vision-to-3D backbones (depth and pose regressors) with the MLLM core to improve spatial generalization.
- Developing prompting strategies or lightweight adapter layers that enforce cross-view feature matching, beyond language-only reasoning pipelines.
These interventions are motivated by All-Angles Bench’s demonstration that architectural or data-centric improvements, rather than solely prompting, are critical for achieving human-level multi-view proficiency (Yeh et al., 21 Apr 2025).
7. "All-Angles Bench" in Four-Loop QCD Cusp Anomalous Dimension
In high-energy theory, "all-angles bench" refers to the analytic calculation and benchmarking of the QCD fermionic cusp anomalous dimension across all values of the Euclidean angle , from infinitesimal to light-like and antiparallel limits (Brüser et al., 2019). This all-angles evaluation tests conjectured universal formulae for angular dependence in the four-loop regime over the complete kinematic range, not just isolated expansions. The resulting benchmark tabulates small- expansions (to order ), large-angle (light-like) limits, and antiparallel cases, for all seven fermionic color structures. Concordance and discrepancies with universal conjectures are fully catalogued, providing stringent validation for QCD resummation applications and exposing the boundaries of current analytic control (Brüser et al., 2019).
A plausible implication is that "All-Angles Bench"—whether in machine perception or quantum field theory—serves as an acid test for models’ ability to generalize under nontrivial angular, viewpoint, or geometric transformations. In both disciplines, success on such a benchmark is a prerequisite to claims of invariance, correspondence, or full-scene understanding, and the practice of benchmarking across all physically relevant angles is essential for rigorous progress.