CrossPoint-Bench: Spatial VLM Benchmark

Updated 6 December 2025

CrossPoint-Bench is a hierarchical benchmark that evaluates vision-language models on fine-grained spatial grounding and cross-view point correspondence.
It employs a multi-level task suite ranging from single-view grounding to continuous coordinate prediction using a large affordance-focused dataset.
Baseline results highlight significant gaps in pixel-accurate correspondence, guiding future research in geometry-aware and multi-agent systems.

CrossPoint-Bench is a hierarchical benchmark designed to evaluate and advance the capability of Vision-LLMs (VLMs) for Cross-View Point Correspondence (CVPC), addressing the challenge of precise spatial grounding and multi-view affordance reasoning in embodied AI contexts. Drawing inspiration from human cognitive processes—perceive, reason, correspond—it provides a suite of tasks ranging from fine-grained single-view grounding to continuous cross-view point prediction, using a large-scale, affordance-focused dataset generated by a comprehensive automated pipeline. CrossPoint-Bench establishes new evaluation protocols, baseline results, and insights critical for future progress in actionable vision-language understanding (Wang et al., 4 Dec 2025).

1. Motivation and Problem Definition

CrossPoint-Bench targets two intrinsically interdependent capabilities fundamental for embodied agents and spatially-aware AI systems:

Fine-grained coordinate prediction: Current VLMs can localize object regions (e.g., "the chair") but lack the precision to identify exact pixel locations critical for manipulation tasks (e.g., "grasp the chair handle at this precise point").
Point-level cross-view correspondence: In multi-camera or multi-agent deployments, it is necessary to map a given point $p_a$ in view $I_a$ to corresponding visibility status and precise position $p_b$ in another view $I_b$ .

This focus responds to the demands of robotic manipulation, multi-agent collaboration, and teleoperation, where pixel-accurate, affordance-aware cues are indispensable for real-world task execution. Without robust geometric consistency and high-resolution correspondence, VLMs are fundamentally limited in downstream actionable intelligence (Wang et al., 4 Dec 2025).

2. Hierarchical Benchmark Structure

CrossPoint-Bench institutes a multi-level task suite mirroring the structure of human cognitive processing:

Level 1: Perceive — Fine-grained Grounding
- Task: Given $I_a$ and instruction $T$ , predict 2D location $p_a = (x_a, y_a)$ in image $I_a$ .
- Formulation: Input $(I_a, T) \rightarrow p_a \in [0,W) \times [0,H)$ .
Level 2: Reason — Visibility Reasoning
- Task: For a source point $(I_a, p_a)$ , determine if the corresponding 3D point is visible in $I_b$ , outputting $y_{\text{vis}} \in \{0,1\}$ .
- Formulation: Input $(I_a, I_b, p_a) \rightarrow y_{\text{vis}}$ .
Level 3: Correspond — Cross-view Point Matching
- (a) Correspondence–Judgement: Discrete selection among $K$ candidates in $I_b$ ; output index $\hat\ell \in \{1, ..., K\}$ .
- (b) Correspondence–Pointing: Continuous 2D point prediction in $I_b$ , i.e., if visible, find precise $p_b$ .
- Formulation: Input $(I_a, I_b, p_a) \rightarrow p_b$ .

Each task is further stratified by affordance granularity:

General objects (e.g., "light switch")
Semantic parts (e.g., "door handle")

Task complexity escalates systematically from singe-view point grounding, to binary visibility classification, to region-level matching, culminating in unconstrained coordinate generation (Wang et al., 4 Dec 2025).

3. Evaluation Protocol and Metrics

CrossPoint-Bench comprises 1,000 QA samples derived from 100 held-out indoor scenes, strictly excluded from the pretraining dataset CrossPoint-378K to enforce scene-level independence. Human annotators achieve an overall score of 91.75%.

Metrics are structured by task modality:

Discrete tasks (Grounding, Visibility, Judgement): Mean accuracy

$\mathrm{Acc} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{\hat y_i = y_i\}$

Continuous task (Correspondence–Pointing): In-mask hit rate

$\mathrm{HitRate} = \frac{1}{N} \sum_{i=1}^N \mathbf{1}\{p_{b,i} \in M_{b,i}\}$

where $M_{b,i}$ represents the ground-truth target mask in image $I_b$ .

Strict fairness is imposed—all models use uniform prompts and are restricted from extra retrieval, explicit 3D geometry, ensembling, or post-processing outside their own finetuning regime (Wang et al., 4 Dec 2025).

4. Dataset Construction: CrossPoint-378K

CrossPoint-378K, supporting CrossPoint-Bench, is constructed using a robust automated pipeline motivated by the need for diversity, scale, and affordance relevance:

Scope: 378,000 QA pairs from 900 indoor scenes (~23,000 images)
Affordance focus: covering 162 verb-based interaction types and 8,000 object instances, emphasizing actionable regions such as handles and buttons.

Pipeline

Image Sampling & Filtering: Downsample high-frame-rate ScanNet videos (average 1:50) to yield 23,000 high-quality images; Qwen2.5-VL-7B filters for frame quality.
Affordance Segmentation: Gemini-2.5-Pro detects operable objects; Grounding-DINO provides bounding boxes; RoboBrain 2.0 extracts semantic parts; SAM 2.1 segments 58,000 fine-grained masks.
Cross-view Pairing: Randomly sample pixels per mask, use ScanNet RGB-D back-projection or ScanNet++ COLMAP SfM tracks to recover 3D point $P$ , then reproject into all viable views, yielding 111,000 matched correspondences.
QA Generation: Template-based auto-generation fills object class, action, and coordinate placeholders, yielding comprehensive coverage of all task types and spatial/depth variations.

This extensive dataset underpins both supervised training (notably for CroPond) and robust held-out evaluation (Wang et al., 4 Dec 2025).

5. Baseline Performance and Model Training

Evaluation on CrossPoint-Bench reveals significant disparities between state-of-the-art VLMs and human spatial abilities:

Model	Fine-grained	Visibility	Correspondence	Overall
Human	79.51 %	92.73 %	97.44 %	91.75
Gemini-2.5-Pro	32.92 %	67.27 %	60.26 %	37.10
Qwen3-VL-235B-A22B-Think	65.22 %	62.27 %	66.03 %	52.70
CroPond-3B	59.01 %	72.27 %	76.92 %	71.60
CroPond-7B	60.25 %	79.09 %	87.18 %	76.80

Key findings:

Gemini-2.5-Pro's overall score of 37.1% lags 54.65 points behind human annotators.
Correspondence–Pointing task is the most challenging, with Gemini reporting a hit rate of 16.41% versus a human score of 93.63%.
CroPond-7B, trained on CrossPoint-378K and auxiliary data, achieves 76.8% overall and approaches 83% of human accuracy in correspondence pointing, outperforming Gemini-2.5-Pro by +39.7 percentage points.

CroPond Model Training Details

Backbone: Qwen2.5-VL (3B or 7B parameter configurations)
Supervision: Multi-source SFT jointly on CrossPoint-378K (378K), RefSpatial (200K), SAT (172K), MulSeT (104K), SPAR-7M (150K), LLaVA-1.5 instructions (400K)
Loss:

$\mathcal{L}_{\rm SFT} = -\mathbb{E}_{(O,Q,A)\sim D} \sum_{t=1}^T \log \pi_\theta(y_t | O,Q,y_{<t})$

Optimization: Full-parameter finetuning, learning rate = 1e-5 (cosine), batch size 2 (accum.), bf16, DeepSpeed ZeRO3.

All resources (benchmark, dataset, CroPond models) are available at https://github.com/WangYipu2002/CrossPoint (Wang et al., 4 Dec 2025).

6. Key Insights and Directions

CrossPoint-Bench highlights persistent limitations and provides a roadmap for further development:

Continuous-coordinate degradation: Models lose nearly 37 percentage points when shifting from discrete region selection to continuous 2D point prediction.
Affordance-gap: A consistent ~10 point performance gap separates semantic parts (e.g., "handle") from general object regions.
Observed failure modes:
- Frame Transfer Failure: Incorrect handling of cross-view coordinate transformations.
- Spatial Reconstruction Failure: Inability to form coherent multi-view 3D layouts.
- Semantic–Point Decoupling: Disconnect between object-level identification and precise point association.

Recommended future research directions include geometry-aware training (multi-view consistency and contrastive 3D supervision), reinforcement learning with geometry-informed reward structures, architectural inductive biases such as explicit pose encoders or differentiable projection modules, and synthesis with multi-agent planning frameworks. These approaches aim to bridge the accuracy, generalization, and reasoning gaps exposed by CrossPoint-Bench and pave the way for actionable, fine-grained vision-language capabilities (Wang et al., 4 Dec 2025).

Markdown Upgrade to Chat

References (1)

Towards Cross-View Point Correspondence in Vision-Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CrossPoint-Bench.

CrossPoint-Bench: Spatial VLM Benchmark

1. Motivation and Problem Definition

2. Hierarchical Benchmark Structure

3. Evaluation Protocol and Metrics

4. Dataset Construction: CrossPoint-378K

Pipeline

5. Baseline Performance and Model Training

CroPond Model Training Details

6. Key Insights and Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

CrossPoint-Bench: Spatial VLM Benchmark

1. Motivation and Problem Definition

2. Hierarchical Benchmark Structure

3. Evaluation Protocol and Metrics

4. Dataset Construction: CrossPoint-378K

Pipeline

5. Baseline Performance and Model Training

CroPond Model Training Details

6. Key Insights and Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research