ScenePoint: Structured 3D Scene Understanding
- ScenePoint is a structured 3D point-map dataset that encodes explicit (x, y, z) coordinates on a dense 2D grid, enabling unified scene understanding.
- It integrates 6,562 multi-view RGB-D room scenes with 1 million pseudo-3D single-view images to capture comprehensive indoor and web imagery.
- The dataset bridges 2D vision transformers and geometric reasoning, supporting applications like embodied navigation, scene retrieval, and fine-grained localization.
ScenePoint refers to a large-scale point-map dataset and representational paradigm for structured 3D scene understanding via explicit 3D coordinates stored on a dense 2D grid. Developed in the context of the POMA-3D framework, ScenePoint enables 3D representation learning that bridges the gap between 2D computer vision foundation models and geometry-based 3D understanding, supporting applications ranging from embodied navigation to 3D question answering, scene retrieval, and fine-grained localization. The ScenePoint corpus comprises over 6,500 multi-view room-level RGB-D scenes and 1 million pseudo-3D single-view images, achieving comprehensive coverage of typical indoor environments and large-scale web imagery. By design, ScenePoint provides a geometric grounding that preserves global 3D structure and is naturally compatible with 2D vision transformer inputs, supporting canonical-coordinate alignment for multi-view learning (Mao et al., 20 Nov 2025).
1. Definition and Core Structure
ScenePoint constitutes a point map dataset where each point map is defined as a structured 2D grid , with each grid element (pixel) storing explicit 3D coordinates. In contrast to depth maps, which only capture relative depth, each ScenePoint map encodes global 3D geometry aligned to a canonical coordinate frame.
ScenePoint includes two principal subsets:
- 6,562 room-level RGB-D scenes sampled from ScanNet (1,499), 3RScan (1,204), and ARKitScenes (3,850). For each scene, multi-view RGB-D frames are sampled to maximize spatial coverage, with each view yielding a point map and an associated LLM-generated caption.
- 1,000,000 single-view "image scenes," produced by reconstructing web images from ConceptualCaptions into point maps using a pretrained 3D lifting model. These inherit image-level alt-texts.
The dataset’s focus on structured grid-based 3D encoding allows direct compatibility with 2D ViT backbones for downstream representation learning (Mao et al., 20 Nov 2025).
2. Canonical Space and View Dependency
All ScenePoint maps are expressed in a shared canonical (world) coordinate frame, ensuring that geometry across views and sources is co-registered. The unprojection pipeline transforms per-pixel depths , camera intrinsics , and extrinsic transforms to canonical scene coordinates:
where is the rotation matrix and is the translation vector. Each multi-view point map is generated using its view-specific pose but references the same canonical space, allowing unified scene fusion.
The room-level scans, comprising approximately points per view (roughly 300,000 points per scene), enable dense geometric supervision for 3D consistency. The single-view maps, though pseudo-3D, follow the same alignment pipeline.
3. Data Collection, Preprocessing, and Augmentation
Room-level subset data is sourced from multiple public RGB-D datasets. For each scan, a maximum-coverage sampler—derived from Video-3D-LLM methodology—selects 32 RGB-D frames that ensure spatial diversity. The corresponding point maps are constructed without downsampling, preserving full depth resolution. Sensor noise and other artifacts are intentionally left unprocessed, with robustness to such input expected to be handled by representation learning.
For the single-view subset, web images from ConceptualCaptions are lifted into pseudo-3D via a pretrained VGGT depth model, estimating both depth and camera pose, which are then unprojected as described.
Caption curation proceeds via LLMs (InternVL3-14B), with individual captions ranked using FG-CLIP cosine similarity. Scene-level captions for room scans are inherited from the SceneVerse corpus.
Random masking augmentation is employed during POMA-JEPA pretraining: each view receives a mask of scale in and aspect ratio in , supporting predictive learning under partial context (Mao et al., 20 Nov 2025).
4. Pretraining Regime and Usage
ScenePoint is designed exclusively for pretraining and is not formally split into train/val/test sets. The POMA-3D regime leverages ScenePoint across two stages:
- Warm-up Stage: Self-supervised alignment on all 1M single-view point maps, optimizing the view-level CLIP alignment loss .
- Main Stage: Joint optimization on the 6.5K multi-view room scenes, combining
- View-level CLIP alignment ()
- Scene-level CLIP alignment ()
- Multi-view consistency via POMA-JEPA ()
LoRA fine-tuning (rank 32, ) is applied on the context encoder to enable efficient model adaptation. The ScenePointDataset PyTorch class supports dynamic multi-view sampling and caption loading in this regime.
5. Downstream Impact and Benchmarks
Pretraining on ScenePoint enables POMA-3D to outperform prior 3D and vision-LLMs on several benchmarks with geometric-only input:
- 3D Question Answering (EM@1):
- ScanQA: FG-CLIP (20.9) POMA-3D\textsubscript{spec} (22.3)
- SQA3D: FG-CLIP (49.5) POMA-3D\textsubscript{spec} (51.1)
- Hypo3D: FG-CLIP (31.1) POMA-3D\textsubscript{spec} (33.4)
- Embodied Navigation (Success Rate, 4-direction/8-direction):
- 39.3/20.4 (FG-CLIP\textsubscript{pm}) 40.4/21.2 (POMA-3D\textsubscript{spec})
- Scene Retrieval (R@1–1 / R@1–5 / R@5–1 / R@5–5 on ScanRefer):
- 0.50/2.00/0.25/2.81 (FG-CLIP\textsubscript{pm}) 9.31/27.9/29.4/59.4 (POMA-3D)
- Embodied Localization: POMA-3D reliably retrieves the correct subset of point-map views given situational language, outperforming both specialist 3D and generalist VLL approaches.
These results demonstrate that point-map-based pretraining transfers strongly to geometric reasoning, cross-view correspondence, and text-guided localization, leveraging only explicit geometry.
6. Technical Implementation and Accessibility
ScenePoint is released under an MIT-style license alongside the full POMA-3D framework. Reference code exposes a "ScenePointDataset" for efficient multi-view sampling, caption access, and pretraining orchestration. The frozen vision-language backbone is FG-CLIP’s ViT-B/16, with encoders for both context and target point maps initialized from this model. No explicit denoising or mesh post-processing is performed.
7. Role within Emerging ScenePoint Paradigms
ScenePoint advances the broader paradigm of bidirectional point-level language–geometry interfaces. In systems such as "Talking Points: Describing and Localizing Pixels" (Rusanovsky et al., 16 Oct 2025), a ScenePoint instantiation is achieved by mapping between pixels and free-form language, facilitating pixel-precise language grounding via dual modules—a Descriptor (ScenePoint to language) and Localizer (language to ScenePoint). This bidirectional interface enables controlled editing, AR anchoring, robotic manipulation, and fine-resolution scene understanding, suggesting that ScenePoint-style representation serves both as a pretraining substrate and as a foundation for pixel-/point-level semantic interaction. A plausible implication is that large-scale, canonical point maps with corresponding language descriptions may become standard for training unified geometric-vision-LLMs, particularly as view consistency and fine-grained control become more central to embodied AI and interactive scene interpretation (Mao et al., 20 Nov 2025, Rusanovsky et al., 16 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free