CrossPoint-Bench: Spatial VLM Benchmark
- CrossPoint-Bench is a hierarchical benchmark that evaluates vision-language models on fine-grained spatial grounding and cross-view point correspondence.
- It employs a multi-level task suite ranging from single-view grounding to continuous coordinate prediction using a large affordance-focused dataset.
- Baseline results highlight significant gaps in pixel-accurate correspondence, guiding future research in geometry-aware and multi-agent systems.
CrossPoint-Bench is a hierarchical benchmark designed to evaluate and advance the capability of Vision-LLMs (VLMs) for Cross-View Point Correspondence (CVPC), addressing the challenge of precise spatial grounding and multi-view affordance reasoning in embodied AI contexts. Drawing inspiration from human cognitive processes—perceive, reason, correspond—it provides a suite of tasks ranging from fine-grained single-view grounding to continuous cross-view point prediction, using a large-scale, affordance-focused dataset generated by a comprehensive automated pipeline. CrossPoint-Bench establishes new evaluation protocols, baseline results, and insights critical for future progress in actionable vision-language understanding (Wang et al., 4 Dec 2025).
1. Motivation and Problem Definition
CrossPoint-Bench targets two intrinsically interdependent capabilities fundamental for embodied agents and spatially-aware AI systems:
- Fine-grained coordinate prediction: Current VLMs can localize object regions (e.g., "the chair") but lack the precision to identify exact pixel locations critical for manipulation tasks (e.g., "grasp the chair handle at this precise point").
- Point-level cross-view correspondence: In multi-camera or multi-agent deployments, it is necessary to map a given point in view to corresponding visibility status and precise position in another view .
This focus responds to the demands of robotic manipulation, multi-agent collaboration, and teleoperation, where pixel-accurate, affordance-aware cues are indispensable for real-world task execution. Without robust geometric consistency and high-resolution correspondence, VLMs are fundamentally limited in downstream actionable intelligence (Wang et al., 4 Dec 2025).
2. Hierarchical Benchmark Structure
CrossPoint-Bench institutes a multi-level task suite mirroring the structure of human cognitive processing:
- Level 1: Perceive — Fine-grained Grounding
- Task: Given and instruction , predict 2D location in image .
- Formulation: Input .
- Level 2: Reason — Visibility Reasoning
- Task: For a source point , determine if the corresponding 3D point is visible in , outputting .
- Formulation: Input .
- Level 3: Correspond — Cross-view Point Matching
- (a) Correspondence–Judgement: Discrete selection among candidates in ; output index .
- (b) Correspondence–Pointing: Continuous 2D point prediction in , i.e., if visible, find precise .
- Formulation: Input .
Each task is further stratified by affordance granularity:
- General objects (e.g., "light switch")
- Semantic parts (e.g., "door handle")
Task complexity escalates systematically from singe-view point grounding, to binary visibility classification, to region-level matching, culminating in unconstrained coordinate generation (Wang et al., 4 Dec 2025).
3. Evaluation Protocol and Metrics
CrossPoint-Bench comprises 1,000 QA samples derived from 100 held-out indoor scenes, strictly excluded from the pretraining dataset CrossPoint-378K to enforce scene-level independence. Human annotators achieve an overall score of 91.75%.
Metrics are structured by task modality:
- Discrete tasks (Grounding, Visibility, Judgement): Mean accuracy
- Continuous task (Correspondence–Pointing): In-mask hit rate
where represents the ground-truth target mask in image .
Strict fairness is imposed—all models use uniform prompts and are restricted from extra retrieval, explicit 3D geometry, ensembling, or post-processing outside their own finetuning regime (Wang et al., 4 Dec 2025).
4. Dataset Construction: CrossPoint-378K
CrossPoint-378K, supporting CrossPoint-Bench, is constructed using a robust automated pipeline motivated by the need for diversity, scale, and affordance relevance:
- Scope: 378,000 QA pairs from 900 indoor scenes (~23,000 images)
- Affordance focus: covering 162 verb-based interaction types and 8,000 object instances, emphasizing actionable regions such as handles and buttons.
Pipeline
- Image Sampling & Filtering: Downsample high-frame-rate ScanNet videos (average 1:50) to yield 23,000 high-quality images; Qwen2.5-VL-7B filters for frame quality.
- Affordance Segmentation: Gemini-2.5-Pro detects operable objects; Grounding-DINO provides bounding boxes; RoboBrain 2.0 extracts semantic parts; SAM 2.1 segments 58,000 fine-grained masks.
- Cross-view Pairing: Randomly sample pixels per mask, use ScanNet RGB-D back-projection or ScanNet++ COLMAP SfM tracks to recover 3D point , then reproject into all viable views, yielding 111,000 matched correspondences.
- QA Generation: Template-based auto-generation fills object class, action, and coordinate placeholders, yielding comprehensive coverage of all task types and spatial/depth variations.
This extensive dataset underpins both supervised training (notably for CroPond) and robust held-out evaluation (Wang et al., 4 Dec 2025).
5. Baseline Performance and Model Training
Evaluation on CrossPoint-Bench reveals significant disparities between state-of-the-art VLMs and human spatial abilities:
| Model | Fine-grained | Visibility | Correspondence | Overall |
|---|---|---|---|---|
| Human | 79.51 % | 92.73 % | 97.44 % | 91.75 |
| Gemini-2.5-Pro | 32.92 % | 67.27 % | 60.26 % | 37.10 |
| Qwen3-VL-235B-A22B-Think | 65.22 % | 62.27 % | 66.03 % | 52.70 |
| CroPond-3B | 59.01 % | 72.27 % | 76.92 % | 71.60 |
| CroPond-7B | 60.25 % | 79.09 % | 87.18 % | 76.80 |
Key findings:
- Gemini-2.5-Pro's overall score of 37.1% lags 54.65 points behind human annotators.
- Correspondence–Pointing task is the most challenging, with Gemini reporting a hit rate of 16.41% versus a human score of 93.63%.
- CroPond-7B, trained on CrossPoint-378K and auxiliary data, achieves 76.8% overall and approaches 83% of human accuracy in correspondence pointing, outperforming Gemini-2.5-Pro by +39.7 percentage points.
CroPond Model Training Details
- Backbone: Qwen2.5-VL (3B or 7B parameter configurations)
- Supervision: Multi-source SFT jointly on CrossPoint-378K (378K), RefSpatial (200K), SAT (172K), MulSeT (104K), SPAR-7M (150K), LLaVA-1.5 instructions (400K)
- Loss:
- Optimization: Full-parameter finetuning, learning rate = 1e-5 (cosine), batch size 2 (accum.), bf16, DeepSpeed ZeRO3.
All resources (benchmark, dataset, CroPond models) are available at https://github.com/WangYipu2002/CrossPoint (Wang et al., 4 Dec 2025).
6. Key Insights and Directions
CrossPoint-Bench highlights persistent limitations and provides a roadmap for further development:
- Continuous-coordinate degradation: Models lose nearly 37 percentage points when shifting from discrete region selection to continuous 2D point prediction.
- Affordance-gap: A consistent ~10 point performance gap separates semantic parts (e.g., "handle") from general object regions.
- Observed failure modes:
- Frame Transfer Failure: Incorrect handling of cross-view coordinate transformations.
- Spatial Reconstruction Failure: Inability to form coherent multi-view 3D layouts.
- Semantic–Point Decoupling: Disconnect between object-level identification and precise point association.
Recommended future research directions include geometry-aware training (multi-view consistency and contrastive 3D supervision), reinforcement learning with geometry-informed reward structures, architectural inductive biases such as explicit pose encoders or differentiable projection modules, and synthesis with multi-agent planning frameworks. These approaches aim to bridge the accuracy, generalization, and reasoning gaps exposed by CrossPoint-Bench and pave the way for actionable, fine-grained vision-language capabilities (Wang et al., 4 Dec 2025).