Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 167 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 31 tok/s Pro
GPT-4o 106 tok/s Pro
Kimi K2 187 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

VST-Perception: 3D Spatial Perception Dataset

Updated 11 November 2025
  • VST-Perception (VST-P) is a large-scale dataset designed to equip vision–language models with detailed 3D spatial and spatiotemporal perception skills.
  • It unifies 19 spatial skills across single-image, multi-image, and video modalities with diverse, rigorously annotated prompts and filtering methods.
  • Fine-tuning on VST-P significantly improves spatial benchmarks, establishing a strong foundation for advanced 3D reasoning and physically grounded tasks.

VST-Perception (VST-P) is a large-scale dataset created to endow vision–LLMs (VLMs) with comprehensive 3D spatial and spatiotemporal perception abilities. This dataset constitutes the foundational stage of the Visual Spatial Tuning (VST) framework, systematically targeting the gap between conventional 2D visual processing and the deeper, physically grounded understanding of spatial layouts, object relations, and scene dynamics required for high-level visuospatial intelligence. VST-P is designed expressly for supervised fine-tuning of general-purpose VLMs, and comprises 4.1 million annotated samples distributed across 19 specialized spatial “skills” encompassing single images, paired images (“multi-image”), and video sequences (Yang et al., 7 Nov 2025).

1. Conceptual Framework and Purpose

VST-P (“VST-Perception”) serves as the foundational perceptual corpus within the VST system, whose goal is to cultivate human-like visuospatial competencies in VLMs. Its explicit design objective is to bridge the gap from 2D image/language grounding to three-dimensional scene perception, object localization, depth reasoning, spatiotemporal change detection, and cross-view understanding. Supervised fine-tuning (SFT) on VST-P is intended as the first stage (preceding structured spatial reasoning via VST-R and reinforcement learning), teaching a model to map visual input sequences—and associated language or programmatic prompts—to precise, physically meaningful spatial outputs.

2. Spatial Skills Coverage and Dataset Structure

VST-P incorporates 19 spatial skills, formally grouped by input modality: single-image, multi-image, and video. Each “skill” is instantiated by a distinct prompting and output paradigm, promoting the learning of parameterized geometric, physical, or spatial scene properties. Modal coverage is dominated by single-image samples (64.8%, ≈2.66M), followed by multi-image (33.1%, ≈1.36M), and a video segment (2.1%, ≈86K). Within each skill, input/output formats and instructional templates are diversified for maximal generalization.

Modality #Skills % of Samples Typical Samples per Skill
Single-image 11 64.8 240K – 300K
Multi-image 7 33.1 180K – 260K
Video 1–3† 2.1 20K – 30K

† Video skills are merged multiformat tasks.

Representative skills include depth ranking (visual, textual, point-based), 3D object detection (single view and multiview), camera-motion classification, measurement tasks (height, largest dimension), inter-object relational reasoning, and spatially-grounded scene captioning.

3. Data Sources, Curation, and Annotation Pipelines

VST-P aggregates annotated samples from a broad array of aligned real and synthetic 3D sources:

  • Single-image: Depth maps from ScanNet++ (real-world), Hypersim (synthetic), and COCO with expert model pseudo-labeling; 3D boxes from ScanNet, ARKitScenes, Hypersim, SUN-RGBD, Matterport3D, Objectron, harmonized with EmbodiedScan corrections; synthetic rare layout completion via Isaac Sim (GUTopia).
  • Multi-image: RGB-D scans from ScanNet, ScanNet++, and ARKitScenes; point-cloud sampling for fine correspondence; camera-pose (Euler angle) derivation for motion.
  • Video: Derived by temporally augmenting the multi-image engine with explicit timestamp, object-appearance, and dialog templates; one-third of video samples reorganized from VLM-3R into multiturn dialogs.

Annotation is governed by:

  • FoV unification: Projecting all data to a single synthetic focal length according to

$hfov = 2\atan\left(\frac{W}{2f}\right),\quad wfov = 2\atan\left(\frac{H}{2f}\right)$

Wnew=2fnewtan(hfov2),Hnew=2fnewtan(wfov2)W_{\text{new}} = 2 f_{\text{new}} \tan\left(\frac{hfov}{2}\right),\quad H_{\text{new}} = 2 f_{\text{new}} \tan\left(\frac{wfov}{2}\right)

  • Occlusion filtering: Correspondences retained when 0u<W,0v<H,z>00\leq u<W, 0\leq v<H, z>0, and zgtzdepth/zgt0.05|z_{gt}-z_{depth}|/z_{gt}\leq 0.05.
  • Scene captioning: Vetted using strong off-the-shelf VLMs and spot-checked by human annotators.

Instruction templates and output formats for each skill are systematically diversified using text, points, boxes, and visual markers to minimize prompt overfitting.

4. Prompt-Response Formats and Representative Samples

Sample prompts and expected output formalizations are standardized per skill. For depth-sorting (visual-point), the model receives a human-formulated question referencing visually indicated points, and outputs the nearer label ("A" or "B"). For 3D detection:

1
2
3
4
{
  "camera_params": {"hfov":69.2,"vfov":53.2,"W":959,"H":696},
  "task":"Detect the 3D boxes of 'printer'."
}
and the output consists of a JSON array of 3D bounding box parameter vectors. Multi-view correspondence, motion, and relational tasks adopt similar paradigms, augmenting with cross-image reference or temporal context as needed.

5. Training Objectives and Quality Constraints

During VLM SFT, the loss is the token-level cross entropy over text regions:

Lθ(x)=i=2L1{xitext}  wilogpθ(xix<i)\mathcal{L}_\theta(x) = -\sum_{i=2}^{L} \mathbb{1}_{\{x_i\in\mathrm{text}\}} \;w_i\, \log p_\theta(x_i|x_{<i})

with weights wiw_i potentially modulated by skill or prompt type. For 3D detection during reinforcement learning, the combined reward averages IoU and F1:

R3d(y,y^)=αRiou(y,y^)+(1α)RF1(y,y^),α=0.5\mathcal{R}_{\mathrm{3d}}(y, \hat y) =\alpha\,\mathcal{R}_{\mathrm{iou}}(y, \hat y) +(1-\alpha)\,\mathcal{R}_{\mathrm{F1}}(y, \hat y),\quad \alpha=0.5

Template and modality diversity, combined with rigorous occlusion and viewpoint filtering, aim to minimize annotation artifacts and prompt-induced shortcut learning.

6. Baseline Performance and Empirical Impact

Stage 1 fine-tuning of a pretrained VLM on VST-P (with one-third general-domain mixture) yields notable gains on spatial benchmarks:

Benchmark Pretrained +VST-P (SFT) Δ
CVBench-3D 72.6 93.4 20.8
3DSRBench 50.5 53.2 2.7
MMSI-Bench 26.1 28.8 2.7
BLINK 49.2 50.6 1.4
VSIBench 29.6 38.7 9.1

The spatial average rises from ~49.9 to ~56.4 (+6.5), confirming that VST-P provides foundations for VLMs to reason about depth, layout, and 3D relationships (Yang et al., 7 Nov 2025). A plausible implication is that broad, modality-consistent spatial perception pretraining substantially improves transfer to downstream spatial and physical reasoning tasks, even before domain-specific reasoning finetuning.

7. Limitations and Significance within the VST Framework

VST-P, as defined, is limited to perception-centric (not high-level reasoning) tasks; it does not encode multi-step spatial reasoning or abstract planning, which are addressed by the subsequent VST-R corpus and RL-based curriculum. Video samples, comprising 2.1% of the total, are less numerous than static or two-frame examples, potentially limiting temporal generalization. The curation pipeline may still underrepresent rare compositions or extreme viewpoints, despite synthetic data augmentation. Nonetheless, the results establish VST-P as an empirical benchmark and pretraining resource, uniquely structured for injecting rich spatial priors into large-scale multimodal architectures. The dataset’s breadth of spatial modalities and annotation harmonization is—in the context of the VST system—a foundational driver of improved state-of-the-art generalization to physically grounded spatial benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to VST-Perception (VST-P).