Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 94 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 33 tok/s
GPT-5 High 35 tok/s Pro
GPT-4o 92 tok/s
GPT OSS 120B 441 tok/s Pro
Kimi K2 227 tok/s Pro
2000 character limit reached

Scene-30K: 3D Scene Reasoning Dataset

Updated 6 August 2025
  • Scene-30K is a synthetic Chain-of-Thought dataset that provides detailed intermediate reasoning steps for enhanced 3D scene understanding.
  • It integrates multi-view visual and textual features through automatic annotation and prompt-based generation to deliver structured reasoning triples.
  • Utilized in pretraining, Scene-30K improves performance by around 10% on tasks such as captioning, visual grounding, and spatial reasoning.

The Scene-30K dataset refers to a synthetically curated, high-quality Chain-of-Thought (CoT) dataset tailored for advancing 3D scene understanding, reasoning, and unified vision–LLMing. Developed as part of the 3D-R1 model introduction, it is positioned at the intersection of data-driven foundation modeling and step-wise reasoning in 3D environments, and it leverages automatic annotation technologies, established 3D-VL benchmarks, as well as advanced prompt-based generation engines.

1. Dataset Definition and Structure

Scene-30K is explicitly designed for machine reasoning in 3D visual–language tasks and represents a departure from conventional captioning or answer-only benchmarks by offering detailed intermediate logical steps. Each example comprises:

  • A scene identifier tied to a real or synthetic 3D environment.
  • An associated natural language question referencing the scene or its objects.
  • A machine-generated, multi-step reasoning chain, clearly demarcated via > ... </think> tags.

    • A final answer, enclosed within <answer> ... </answer> tags.

    Quantitatively, Scene-30K consists of approximately 1,500 reconstructed 3D scenes and nearly 33,000 annotated objects. Each sample encodes multimodal relational information paired with step-by-step language explanations, reflecting a structured template that is instrumental for training models to acquire not only correct answers but also interpretable, process-oriented outputs (Huang et al., 31 Jul 2025).

    2. Data Synthesis and Generation Methodology

    The dataset is assembled by synthesizing examples from pre-existing 3D vision–language sources, including ScanQA, ScanRefer, Nr3D, and SceneVerse. The creation pipeline includes:

    • Parsing original scene data (point clouds and multi-view renderings).

    • Employing a 3D scene description generator (a pre-trained vision–LLM) to summarize the scene’s objects, spatial arrangements, and global layout into concise textual form.

    • Feeding the scene description and a relevant question into Gemini 2.5 Pro alongside a specialized prompt. Gemini 2.5 Pro then outputs a structured, multi-step reasoning segment and a final succinct answer.

    • Applying an automated filtering and quality control layer: examples must follow strict output formatting, reach adequate sequence length, exhibit multi-step reasoning, and pass a consistency check (using normalized Levenshtein similarity between a regenerated answer and the original).

    This process yields an initial 35,000 examples, which, after filtering for reasoning depth and output validity, results in 30,000 high-quality CoT samples.

    3. Function and Integration in Model Training

    Scene-30K is deployed as the cold-start, supervised “pretraining” dataset for the 3D-R1 foundation model. During this phase, the model acquires the dual ability to (a) solve 3D visual–language reasoning tasks and (b) generate structured intermediate reasoning that is both process- and answer-oriented. The dataset’s composition renders it suitable for tasks demanding explicit multi-hop reasoning about spatial, semantic, and relational aspects of complex 3D scenes.

    This cold-start initialization is critical: models trained in this fashion not only converge to higher accuracy on downstream 3D-VL tasks but also internalize the ability to explicate reasoning, providing more interpretable outputs during inference.

    4. Reinforcement Learning and Reward-Structured Optimization

    After supervised pretraining, the 3D-R1 system is further refined using a reinforcement learning framework known as Group Relative Policy Optimization (GRPO). Scene-30K’s outputs, owing to their strict structural template, support several reward signals designed to elevate both answer quality and reasoning reliability:

    • Format Reward (RFormatR_\mathrm{Format}): Binary reward (1 or 0), verifying that predictions adhere to <think> ... <answer> ... </answer> structure.
  • Perception Reward (RpR_p): Based on the IoU between predicted scene bounding boxes (bb^*) and ground-truth (bb), Rp=IoU(b,b)R_p = \mathrm{IoU}(b, b^*).

  • Semantic Similarity Reward (RsimilarityR_\mathrm{similarity}): Calculated by cosine similarity between CLIP-encoded answer embeddings, Rsimilarity=cos_sim(CLIPtext(a^),CLIPtext(a))R_\mathrm{similarity} = \operatorname{cos\_sim}(\mathrm{CLIP_{text}}(\hat{a}), \mathrm{CLIP_{text}}(a)).

By combining these rewards, the RL phase guides the model toward reasoning chains and answers that are both formally correct and semantically precise. This staged procedure ensures that Scene-30K not only serves as a direct data resource but as a structural anchor underpinning reward engineering in model optimization.

5. Dynamic View Selection and Cross-Modal Alignment

Given that many 3D VLMs ultimately process 2D projections, Scene-30K-enabled pipelines incorporate a dynamic view selection strategy:

  • For every 3D scene, multiple 2D image renderings are generated from various camera locations.

  • For each rendered view vv and textual prompt tt, the following scores are computed:

    • SText3D(v,t)S_{\mathrm{Text}\to 3D}(v, t): Text-to-3D relevance.
    • SImage3D(v,t)S_{\mathrm{Image}\to 3D}(v, t): Alignment of image features with 3D geometry.
    • SCLIP(v,t)S_\mathrm{CLIP}(v, t): CLIP-based image–text alignment.
  • An overall view utility score is calculated: U(v)=wtSText3D(v,t)+wcSImage3D(v,t)+wclipSCLIP(v,t)U(v) = w_t \cdot S_{\mathrm{Text}\to 3D}(v, t) + w_c \cdot S_{\mathrm{Image}\to 3D}(v, t) + w_\mathrm{clip} \cdot S_\mathrm{CLIP}(v, t), where wtw_t, wcw_c, wclipw_\mathrm{clip} are learned weights.
  • The top-N views with the highest U(v)U(v) are retained for the core model, ensuring efficient and context-rich multi-view encoding (Huang et al., 31 Jul 2025).

This architecture, coupled with Scene-30K’s data design, enhances the VLM’s capability to leverage the most informative visual evidence for downstream scene reasoning.

6. Experimental Outcomes and Benchmark Significance

Experiments integrating Scene-30K-centered initialization in 3D-R1 yield, on average, a 10% improvement across a variety of tasks, including 3D scene captioning, object captioning, visual grounding, question answering, and multi-step planning. Evaluation uses established metrics such as CIDEr, BLEU-4, METEOR, ROUGE-L for captioning and IoU-based measures for spatial reasoning and grounding.

This empirical performance is attributed to the dataset’s explicit inclusion of reasoning chains and its structural enforceability. Scene-30K thus operationalizes complex spatial-semantic reasoning in 3D environments—substantially advancing the performance and interpretability of contemporary 3D-VL models such as 3D-R1 (Huang et al., 31 Jul 2025).

Scene-30K shares etymological similarity with other “30K-scale” datasets, but is distinct in both modality and scope:

  • It diverges from the Camera Scene Detection Dataset (CamSDD), which, although sometimes referred to as “Scene-30K” in downstream benchmarks, targets 2D photographic scene classification with a focus on real-world camera applications (Pouget et al., 2021).
  • Mosaic3D-5.6M, by comparison, aggregates over 30K 3D scenes but emphasizes semantic segmentation via large-scale 3D mask–text pairings and does not provide stepwise reasoning chains (Lee et al., 4 Feb 2025).

The chief distinguishing attribute of Scene-30K is the formalized provision of question–reasoning–answer triples, calibrated through synthesis pipelines and rigorous post-generation filtering, to enable interpretable, multi-hop reasoning models in 3D vision–language contexts.


In summary, Scene-30K represents a benchmark-scale synthetic dataset supporting advanced, interpretable reasoning for 3D scene understanding, serving as a foundational resource for cold-start training and subsequent reinforcement learning in state-of-the-art 3D vision–LLMs. Its unique synthesis and reward-oriented design contribute directly to significant empirical gains in spatial reasoning and unified 3D scene comprehension (Huang et al., 31 Jul 2025).