Scene-30K: 3D Scene Reasoning Dataset
- Scene-30K is a synthetic Chain-of-Thought dataset that provides detailed intermediate reasoning steps for enhanced 3D scene understanding.
- It integrates multi-view visual and textual features through automatic annotation and prompt-based generation to deliver structured reasoning triples.
- Utilized in pretraining, Scene-30K improves performance by around 10% on tasks such as captioning, visual grounding, and spatial reasoning.
The Scene-30K dataset refers to a synthetically curated, high-quality Chain-of-Thought (CoT) dataset tailored for advancing 3D scene understanding, reasoning, and unified vision–LLMing. Developed as part of the 3D-R1 model introduction, it is positioned at the intersection of data-driven foundation modeling and step-wise reasoning in 3D environments, and it leverages automatic annotation technologies, established 3D-VL benchmarks, as well as advanced prompt-based generation engines.
1. Dataset Definition and Structure
Scene-30K is explicitly designed for machine reasoning in 3D visual–language tasks and represents a departure from conventional captioning or answer-only benchmarks by offering detailed intermediate logical steps. Each example comprises:
- A scene identifier tied to a real or synthetic 3D environment.
- An associated natural language question referencing the scene or its objects.
- A machine-generated, multi-step reasoning chain, clearly demarcated via
> ... </think>
tags.- A final answer, enclosed within
<answer> ... </answer>
tags.
Quantitatively, Scene-30K consists of approximately 1,500 reconstructed 3D scenes and nearly 33,000 annotated objects. Each sample encodes multimodal relational information paired with step-by-step language explanations, reflecting a structured template that is instrumental for training models to acquire not only correct answers but also interpretable, process-oriented outputs (Huang et al., 31 Jul 2025).
2. Data Synthesis and Generation Methodology
The dataset is assembled by synthesizing examples from pre-existing 3D vision–language sources, including ScanQA, ScanRefer, Nr3D, and SceneVerse. The creation pipeline includes:
Parsing original scene data (point clouds and multi-view renderings).
Employing a 3D scene description generator (a pre-trained vision–LLM) to summarize the scene’s objects, spatial arrangements, and global layout into concise textual form.
Feeding the scene description and a relevant question into Gemini 2.5 Pro alongside a specialized prompt. Gemini 2.5 Pro then outputs a structured, multi-step reasoning segment and a final succinct answer.
Applying an automated filtering and quality control layer: examples must follow strict output formatting, reach adequate sequence length, exhibit multi-step reasoning, and pass a consistency check (using normalized Levenshtein similarity between a regenerated answer and the original).
This process yields an initial 35,000 examples, which, after filtering for reasoning depth and output validity, results in 30,000 high-quality CoT samples.
3. Function and Integration in Model Training
Scene-30K is deployed as the cold-start, supervised “pretraining” dataset for the 3D-R1 foundation model. During this phase, the model acquires the dual ability to (a) solve 3D visual–language reasoning tasks and (b) generate structured intermediate reasoning that is both process- and answer-oriented. The dataset’s composition renders it suitable for tasks demanding explicit multi-hop reasoning about spatial, semantic, and relational aspects of complex 3D scenes.
This cold-start initialization is critical: models trained in this fashion not only converge to higher accuracy on downstream 3D-VL tasks but also internalize the ability to explicate reasoning, providing more interpretable outputs during inference.
4. Reinforcement Learning and Reward-Structured Optimization
After supervised pretraining, the 3D-R1 system is further refined using a reinforcement learning framework known as Group Relative Policy Optimization (GRPO). Scene-30K’s outputs, owing to their strict structural template, support several reward signals designed to elevate both answer quality and reasoning reliability:
- Format Reward (): Binary reward (1 or 0), verifying that predictions adhere to
<think> ... <answer> ... </answer>
structure.
- A final answer, enclosed within
Perception Reward (): Based on the IoU between predicted scene bounding boxes () and ground-truth (), .
Semantic Similarity Reward (): Calculated by cosine similarity between CLIP-encoded answer embeddings, .
By combining these rewards, the RL phase guides the model toward reasoning chains and answers that are both formally correct and semantically precise. This staged procedure ensures that Scene-30K not only serves as a direct data resource but as a structural anchor underpinning reward engineering in model optimization.
5. Dynamic View Selection and Cross-Modal Alignment
Given that many 3D VLMs ultimately process 2D projections, Scene-30K-enabled pipelines incorporate a dynamic view selection strategy:
For every 3D scene, multiple 2D image renderings are generated from various camera locations.
For each rendered view and textual prompt , the following scores are computed:
- : Text-to-3D relevance.
- : Alignment of image features with 3D geometry.
- : CLIP-based image–text alignment.
- An overall view utility score is calculated: , where , , are learned weights.
- The top-N views with the highest are retained for the core model, ensuring efficient and context-rich multi-view encoding (Huang et al., 31 Jul 2025).
This architecture, coupled with Scene-30K’s data design, enhances the VLM’s capability to leverage the most informative visual evidence for downstream scene reasoning.
6. Experimental Outcomes and Benchmark Significance
Experiments integrating Scene-30K-centered initialization in 3D-R1 yield, on average, a 10% improvement across a variety of tasks, including 3D scene captioning, object captioning, visual grounding, question answering, and multi-step planning. Evaluation uses established metrics such as CIDEr, BLEU-4, METEOR, ROUGE-L for captioning and IoU-based measures for spatial reasoning and grounding.
This empirical performance is attributed to the dataset’s explicit inclusion of reasoning chains and its structural enforceability. Scene-30K thus operationalizes complex spatial-semantic reasoning in 3D environments—substantially advancing the performance and interpretability of contemporary 3D-VL models such as 3D-R1 (Huang et al., 31 Jul 2025).
7. Position Among Related Datasets
Scene-30K shares etymological similarity with other “30K-scale” datasets, but is distinct in both modality and scope:
- It diverges from the Camera Scene Detection Dataset (CamSDD), which, although sometimes referred to as “Scene-30K” in downstream benchmarks, targets 2D photographic scene classification with a focus on real-world camera applications (Pouget et al., 2021).
- Mosaic3D-5.6M, by comparison, aggregates over 30K 3D scenes but emphasizes semantic segmentation via large-scale 3D mask–text pairings and does not provide stepwise reasoning chains (Lee et al., 4 Feb 2025).
The chief distinguishing attribute of Scene-30K is the formalized provision of question–reasoning–answer triples, calibrated through synthesis pipelines and rigorous post-generation filtering, to enable interpretable, multi-hop reasoning models in 3D vision–language contexts.
In summary, Scene-30K represents a benchmark-scale synthetic dataset supporting advanced, interpretable reasoning for 3D scene understanding, serving as a foundational resource for cold-start training and subsequent reinforcement learning in state-of-the-art 3D vision–LLMs. Its unique synthesis and reward-oriented design contribute directly to significant empirical gains in spatial reasoning and unified 3D scene comprehension (Huang et al., 31 Jul 2025).