Papers
Topics
Authors
Recent
2000 character limit reached

CrossPoint-Bench: Spatial VLM Benchmark

Updated 6 December 2025
  • CrossPoint-Bench is a hierarchical benchmark that evaluates vision-language models on fine-grained spatial grounding and cross-view point correspondence.
  • It employs a multi-level task suite ranging from single-view grounding to continuous coordinate prediction using a large affordance-focused dataset.
  • Baseline results highlight significant gaps in pixel-accurate correspondence, guiding future research in geometry-aware and multi-agent systems.

CrossPoint-Bench is a hierarchical benchmark designed to evaluate and advance the capability of Vision-LLMs (VLMs) for Cross-View Point Correspondence (CVPC), addressing the challenge of precise spatial grounding and multi-view affordance reasoning in embodied AI contexts. Drawing inspiration from human cognitive processes—perceive, reason, correspond—it provides a suite of tasks ranging from fine-grained single-view grounding to continuous cross-view point prediction, using a large-scale, affordance-focused dataset generated by a comprehensive automated pipeline. CrossPoint-Bench establishes new evaluation protocols, baseline results, and insights critical for future progress in actionable vision-language understanding (Wang et al., 4 Dec 2025).

1. Motivation and Problem Definition

CrossPoint-Bench targets two intrinsically interdependent capabilities fundamental for embodied agents and spatially-aware AI systems:

  • Fine-grained coordinate prediction: Current VLMs can localize object regions (e.g., "the chair") but lack the precision to identify exact pixel locations critical for manipulation tasks (e.g., "grasp the chair handle at this precise point").
  • Point-level cross-view correspondence: In multi-camera or multi-agent deployments, it is necessary to map a given point pap_a in view IaI_a to corresponding visibility status and precise position pbp_b in another view IbI_b.

This focus responds to the demands of robotic manipulation, multi-agent collaboration, and teleoperation, where pixel-accurate, affordance-aware cues are indispensable for real-world task execution. Without robust geometric consistency and high-resolution correspondence, VLMs are fundamentally limited in downstream actionable intelligence (Wang et al., 4 Dec 2025).

2. Hierarchical Benchmark Structure

CrossPoint-Bench institutes a multi-level task suite mirroring the structure of human cognitive processing:

  • Level 1: Perceive — Fine-grained Grounding
    • Task: Given IaI_a and instruction TT, predict 2D location pa=(xa,ya)p_a = (x_a, y_a) in image IaI_a.
    • Formulation: Input (Ia,T)pa[0,W)×[0,H)(I_a, T) \rightarrow p_a \in [0,W) \times [0,H).
  • Level 2: Reason — Visibility Reasoning
    • Task: For a source point (Ia,pa)(I_a, p_a), determine if the corresponding 3D point is visible in IbI_b, outputting yvis{0,1}y_{\text{vis}} \in \{0,1\}.
    • Formulation: Input (Ia,Ib,pa)yvis(I_a, I_b, p_a) \rightarrow y_{\text{vis}}.
  • Level 3: Correspond — Cross-view Point Matching
    • (a) Correspondence–Judgement: Discrete selection among KK candidates in IbI_b; output index ^{1,...,K}\hat\ell \in \{1, ..., K\}.
    • (b) Correspondence–Pointing: Continuous 2D point prediction in IbI_b, i.e., if visible, find precise pbp_b.
    • Formulation: Input (Ia,Ib,pa)pb(I_a, I_b, p_a) \rightarrow p_b.

Each task is further stratified by affordance granularity:

  • General objects (e.g., "light switch")
  • Semantic parts (e.g., "door handle")

Task complexity escalates systematically from singe-view point grounding, to binary visibility classification, to region-level matching, culminating in unconstrained coordinate generation (Wang et al., 4 Dec 2025).

3. Evaluation Protocol and Metrics

CrossPoint-Bench comprises 1,000 QA samples derived from 100 held-out indoor scenes, strictly excluded from the pretraining dataset CrossPoint-378K to enforce scene-level independence. Human annotators achieve an overall score of 91.75%.

Metrics are structured by task modality:

  • Discrete tasks (Grounding, Visibility, Judgement): Mean accuracy

Acc=1Ni=1N1{y^i=yi}\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{\hat y_i = y_i\}

  • Continuous task (Correspondence–Pointing): In-mask hit rate

HitRate=1Ni=1N1{pb,iMb,i}\mathrm{HitRate} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{p_{b,i} \in M_{b,i}\}

where Mb,iM_{b,i} represents the ground-truth target mask in image IbI_b.

Strict fairness is imposed—all models use uniform prompts and are restricted from extra retrieval, explicit 3D geometry, ensembling, or post-processing outside their own finetuning regime (Wang et al., 4 Dec 2025).

4. Dataset Construction: CrossPoint-378K

CrossPoint-378K, supporting CrossPoint-Bench, is constructed using a robust automated pipeline motivated by the need for diversity, scale, and affordance relevance:

  • Scope: 378,000 QA pairs from 900 indoor scenes (~23,000 images)
  • Affordance focus: covering 162 verb-based interaction types and 8,000 object instances, emphasizing actionable regions such as handles and buttons.

Pipeline

  1. Image Sampling & Filtering: Downsample high-frame-rate ScanNet videos (average 1:50) to yield 23,000 high-quality images; Qwen2.5-VL-7B filters for frame quality.
  2. Affordance Segmentation: Gemini-2.5-Pro detects operable objects; Grounding-DINO provides bounding boxes; RoboBrain 2.0 extracts semantic parts; SAM 2.1 segments 58,000 fine-grained masks.
  3. Cross-view Pairing: Randomly sample pixels per mask, use ScanNet RGB-D back-projection or ScanNet++ COLMAP SfM tracks to recover 3D point PP, then reproject into all viable views, yielding 111,000 matched correspondences.
  4. QA Generation: Template-based auto-generation fills object class, action, and coordinate placeholders, yielding comprehensive coverage of all task types and spatial/depth variations.

This extensive dataset underpins both supervised training (notably for CroPond) and robust held-out evaluation (Wang et al., 4 Dec 2025).

5. Baseline Performance and Model Training

Evaluation on CrossPoint-Bench reveals significant disparities between state-of-the-art VLMs and human spatial abilities:

Model Fine-grained Visibility Correspondence Overall
Human 79.51 % 92.73 % 97.44 % 91.75
Gemini-2.5-Pro 32.92 % 67.27 % 60.26 % 37.10
Qwen3-VL-235B-A22B-Think 65.22 % 62.27 % 66.03 % 52.70
CroPond-3B 59.01 % 72.27 % 76.92 % 71.60
CroPond-7B 60.25 % 79.09 % 87.18 % 76.80

Key findings:

  • Gemini-2.5-Pro's overall score of 37.1% lags 54.65 points behind human annotators.
  • Correspondence–Pointing task is the most challenging, with Gemini reporting a hit rate of 16.41% versus a human score of 93.63%.
  • CroPond-7B, trained on CrossPoint-378K and auxiliary data, achieves 76.8% overall and approaches 83% of human accuracy in correspondence pointing, outperforming Gemini-2.5-Pro by +39.7 percentage points.

CroPond Model Training Details

  • Backbone: Qwen2.5-VL (3B or 7B parameter configurations)
  • Supervision: Multi-source SFT jointly on CrossPoint-378K (378K), RefSpatial (200K), SAT (172K), MulSeT (104K), SPAR-7M (150K), LLaVA-1.5 instructions (400K)
  • Loss:

LSFT=E(O,Q,A)Dt=1Tlogπθ(ytO,Q,y<t)\mathcal{L}_{\rm SFT} = -\mathbb{E}_{(O,Q,A)\sim D} \sum_{t=1}^T \log \pi_\theta(y_t | O,Q,y_{<t})

  • Optimization: Full-parameter finetuning, learning rate = 1e-5 (cosine), batch size 2 (accum.), bf16, DeepSpeed ZeRO3.

All resources (benchmark, dataset, CroPond models) are available at https://github.com/WangYipu2002/CrossPoint (Wang et al., 4 Dec 2025).

6. Key Insights and Directions

CrossPoint-Bench highlights persistent limitations and provides a roadmap for further development:

  • Continuous-coordinate degradation: Models lose nearly 37 percentage points when shifting from discrete region selection to continuous 2D point prediction.
  • Affordance-gap: A consistent ~10 point performance gap separates semantic parts (e.g., "handle") from general object regions.
  • Observed failure modes:
    • Frame Transfer Failure: Incorrect handling of cross-view coordinate transformations.
    • Spatial Reconstruction Failure: Inability to form coherent multi-view 3D layouts.
    • Semantic–Point Decoupling: Disconnect between object-level identification and precise point association.

Recommended future research directions include geometry-aware training (multi-view consistency and contrastive 3D supervision), reinforcement learning with geometry-informed reward structures, architectural inductive biases such as explicit pose encoders or differentiable projection modules, and synthesis with multi-agent planning frameworks. These approaches aim to bridge the accuracy, generalization, and reasoning gaps exposed by CrossPoint-Bench and pave the way for actionable, fine-grained vision-language capabilities (Wang et al., 4 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CrossPoint-Bench.