Papers
Topics
Authors
Recent
Search
2000 character limit reached

InfiniBench-V: Automated VLM Benchmarking

Updated 2 July 2026
  • InfiniBench-V is an automated benchmark generator that translates natural language scene descriptions into photo-realistic 3D videos to evaluate spatial reasoning in vision–language models.
  • It employs an LLM-based constraint refinement, cluster-based layout optimization, and task-aware camera trajectory planning to create dense and challenging scenes.
  • The benchmark offers fine-grained evaluation metrics such as prompt fidelity, CLIPScore, and physical plausibility to diagnose model failures in measurement, perspective-taking, and tracking.

InfiniBench is a fully automated, customizable benchmark generator for evaluating vision–LLMs (VLMs) on visual spatial reasoning tasks under diverse and parameterizable scene complexities. By automatically translating natural language scene descriptions into photo-realistic 3D video benchmarks, InfiniBench enables systematic and granular probing of VLM spatial reasoning abilities, including measurement, perspective-taking, and spatiotemporal tracking. Its innovations include an LLM-based agentic framework for symbolic constraint refinement, a cluster-based layout optimizer for dense and cluttered scenes, and a task-aware camera trajectory optimizer for rendering high-coverage, low-occlusion scene videos. InfiniBench is designed for fine-grained analysis of VLM failures under varying spatial conditions, allowing for the generation of a theoretically infinite set of diverse benchmark scenes at arbitrary complexity levels (Wang et al., 22 Nov 2025).

1. Architecture and System Pipeline

InfiniBench comprises an end-to-end pipeline that sequentially translates natural language scene descriptions into 3D videos, encompassing three major stages:

  1. LLM-based Agentic Constraint Generation and Refinement: Employs a LLM agent to convert textual descriptions into symbolic scene constraints and refines these iteratively using structured optimization feedback. The agent operates using Chain-of-Thought (CoT) prompting to diagnose and resolve conflicting or infeasible constraints based on the layout realization outcomes.
  2. Cluster-Based Layout Optimization: Translates the refined symbolic constraints into concrete, physically plausible 3D arrangements. Objects are grouped into clusters with enforced relative transforms, supporting the creation of dense, cluttered, and highly structured scenes that present significant challenges for existing procedural methods.
  3. Task-Aware Camera Trajectory Planning: Computes a camera path through the assembled 3D scene such that task-relevant objects are covered with minimal occlusion. The module plans a smooth, collision-free trajectory, sampling candidate camera poses to maximize object visibility.

The entire process is encapsulated in a deterministic system pipeline that outputs a video rendering, suitable as a VLM input for benchmarking a specified spatial reasoning task.

2. LLM-Based Agentic Framework for Constraint Refinement

The constraint refinement engine is built around an LLM-driven agent which models the sequence of constraint sets (Ct)(C^t) and their iterative update via an operator A\mathcal{A} conditioned on both in-context examples and the original prompt. At each iteration, the framework:

  • Proposes an updated set of symbolic constraints from the prompt and prior feedback.
  • Submits constraints to the layout engine, which attempts 3D realization and returns either a success flag or, on failure, a structured error report identifying violated constraints and a BEV (bird’s-eye view) collision map.
  • Utilizes CoT reasoning to isolate conflict sources, amend constraints, and iterate.

Constraint satisfaction is evaluated using soft violation costs v(c;X)v(c;X) and a global threshold δ\delta. Success is declared if the total cost for optimal layout realization X(C)X^*(C) satisfies v(c;X(C))δ\sum v(c;X^*(C)) \leq \delta. Otherwise, the agent receives a list of offending constraints for targeted refinement (Wang et al., 22 Nov 2025).

3. Cluster-Based Layout Optimization

Scene realization is formalized as a constrained optimization over the set of object poses and orientations X={(pi,θi)}i=1NX = \{(p_i, \theta_i)\}_{i=1…N}, with object clusters C\mathcal{C} governing local kinematic relations:

minXL(X;C)=αk=1Kclutter(Ck,X)+βoccupancy(X)+γ(i,j)collision_penalty(pi,pj)\min_{X} \mathcal{L}(X;C) = \alpha \sum_{k=1}^K \mathrm{clutter}(C_k,X) + \beta\, \mathrm{occupancy}(X) + \gamma \sum_{(i,j)} \mathrm{collision\_penalty}(p_i,p_j)

Constraints include strict non-overlap (no mesh collisions) and mandated relative transforms within clusters. The optimizer alternates greedy cluster translation/rotation moves, followed by local refinement for individual objects, converging when no further improvement in objective L\mathcal{L} is detectable.

4. Parametric Scene Complexity Control

InfiniBench exposes scene complexity as a three-dimensional vector A\mathcal{A}0:

  • A\mathcal{A}1: Object count.
  • A\mathcal{A}2: Occupancy ratio, A\mathcal{A}3.
  • A\mathcal{A}4: Viewpoint difficulty, measured by the mean occlusion rate A\mathcal{A}5.

Optionally, arrangement entropy A\mathcal{A}6 (with A\mathcal{A}7 the permissible area fraction for class A\mathcal{A}8) further parameterizes spatial diversity. Scene constraints are seeded via explicit mappings from user language (e.g., "set_object_count(Chair, 10)", "set_occupancy(0.30)", "set_occlusion_threshold(0.6)").

5. Task-Aware Camera Trajectory Optimization

Camera trajectory planning ensures that all task-relevant objects achieve visibility above a predefined threshold while minimizing path length and avoiding obstacles. The planner defines a sequence of keyframes, selecting candidate poses that maximize the per-object visibility function A\mathcal{A}9 from pose v(c;X)v(c;X)0. For each unvisited target, the best admissible pose is chosen according to:

  • Obstacle-free navigation (based on a 2D floor plan).
  • Camera frustum enclosing object.
  • Visibility v(c;X)v(c;X)1.

Keyframes are joined via shortest collision-free paths (e.g., Dijkstra), concatenated into a continuous trajectory for rendering.

6. Quantitative Evaluation and Benchmarking

InfiniBench’s outputs are evaluated on axes central to the fidelity and challenge of spatial benchmarks:

  • Prompt fidelity: Error in object count (v(c;X)v(c;X)2) and occupancy (v(c;X)v(c;X)3).
  • Text–image alignment: CLIPScore, computed as the expected cosine similarity between text/prompt and rendered view embeddings.
  • Physical plausibility: Number of out-of-bounds objects and colliding object pairs.
  • Layout realism: Human-evaluated score via GPT-5 in v(c;X)v(c;X)4.

Scene Quality by Generator

Method Fidelity CLIP Realism #OB #CN
Infinigen 0.64 29.7 0.79 0 0.0
LayoutGPT 0.90 28.4 0.72 1.3 2.4
InfiniBench 0.98 31.8 0.93 0 0.0

InfiniBench demonstrates superior prompt fidelity, realism, and physical plausibility across both object count and occupancy axes relative to existing methods (Wang et al., 22 Nov 2025).

7. Spatial Reasoning Task Benchmarks and Model Insights

InfiniBench enables granular studies on several core spatial reasoning tasks:

  • Measurement: Quantifies object property estimation accuracy (e.g., “height in centimeters”) under partial occlusion.
  • Perspective-Taking: Probes counting and grounding (e.g., distractor-aware object numerosity).
  • Spatiotemporal Tracking: Assesses ordering and identity tracking (e.g., which cube appears first) as camera paths increase occlusion and viewpoint challenge.

VLM Performance vs. Scene Complexity

Model Measurement (Low/Med/High) Perspective (Low/Med/High) SpatioTemp (Low/Med/High)
Gemini-2.5 69.2 / 68.9 / 66.4 70.6 / 67.2 / 66.9 87.9 / 70.1 / 56.2
GPT-5 45.8 / 41.2 / 41.3 57.3 / 55.5 / 54.1 47.8 / 31.3 / 26.7

Prompted with increasing v(c;X)v(c;X)5 or v(c;X)v(c;X)6, VLMs exhibit graceful degradation in counting and reference grounding, while high v(c;X)v(c;X)7 (e.g., camera-induced occlusion) increases order-tracking errors.

These results illustrate InfiniBench’s capacity to expose fine-grained VLM failure modes under precise complexity controls, providing actionable signals for model diagnosis and improvement (Wang et al., 22 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to InfiBench-V.