Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 38 tok/s Pro
GPT-5 Medium 36 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 191 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Scene-30K: 3D Scene Reasoning Dataset

Updated 6 August 2025
  • Scene-30K is a synthetic Chain-of-Thought dataset that provides detailed intermediate reasoning steps for enhanced 3D scene understanding.
  • It integrates multi-view visual and textual features through automatic annotation and prompt-based generation to deliver structured reasoning triples.
  • Utilized in pretraining, Scene-30K improves performance by around 10% on tasks such as captioning, visual grounding, and spatial reasoning.

The Scene-30K dataset refers to a synthetically curated, high-quality Chain-of-Thought (CoT) dataset tailored for advancing 3D scene understanding, reasoning, and unified vision–LLMing. Developed as part of the 3D-R1 model introduction, it is positioned at the intersection of data-driven foundation modeling and step-wise reasoning in 3D environments, and it leverages automatic annotation technologies, established 3D-VL benchmarks, as well as advanced prompt-based generation engines.

1. Dataset Definition and Structure

Scene-30K is explicitly designed for machine reasoning in 3D visual–language tasks and represents a departure from conventional captioning or answer-only benchmarks by offering detailed intermediate logical steps. Each example comprises:

  • A scene identifier tied to a real or synthetic 3D environment.
  • An associated natural language question referencing the scene or its objects.
  • A machine-generated, multi-step reasoning chain, clearly demarcated via > ... </think> tags.

    • A final answer, enclosed within <answer> ... </answer> tags.

    Quantitatively, Scene-30K consists of approximately 1,500 reconstructed 3D scenes and nearly 33,000 annotated objects. Each sample encodes multimodal relational information paired with step-by-step language explanations, reflecting a structured template that is instrumental for training models to acquire not only correct answers but also interpretable, process-oriented outputs (Huang et al., 31 Jul 2025).

    2. Data Synthesis and Generation Methodology

    The dataset is assembled by synthesizing examples from pre-existing 3D vision–language sources, including ScanQA, ScanRefer, Nr3D, and SceneVerse. The creation pipeline includes:

    • Parsing original scene data (point clouds and multi-view renderings).

    • Employing a 3D scene description generator (a pre-trained vision–LLM) to summarize the scene’s objects, spatial arrangements, and global layout into concise textual form.

    • Feeding the scene description and a relevant question into Gemini 2.5 Pro alongside a specialized prompt. Gemini 2.5 Pro then outputs a structured, multi-step reasoning segment and a final succinct answer.

    • Applying an automated filtering and quality control layer: examples must follow strict output formatting, reach adequate sequence length, exhibit multi-step reasoning, and pass a consistency check (using normalized Levenshtein similarity between a regenerated answer and the original).

    This process yields an initial 35,000 examples, which, after filtering for reasoning depth and output validity, results in 30,000 high-quality CoT samples.

    3. Function and Integration in Model Training

    Scene-30K is deployed as the cold-start, supervised “pretraining” dataset for the 3D-R1 foundation model. During this phase, the model acquires the dual ability to (a) solve 3D visual–language reasoning tasks and (b) generate structured intermediate reasoning that is both process- and answer-oriented. The dataset’s composition renders it suitable for tasks demanding explicit multi-hop reasoning about spatial, semantic, and relational aspects of complex 3D scenes.

    This cold-start initialization is critical: models trained in this fashion not only converge to higher accuracy on downstream 3D-VL tasks but also internalize the ability to explicate reasoning, providing more interpretable outputs during inference.

    4. Reinforcement Learning and Reward-Structured Optimization

    After supervised pretraining, the 3D-R1 system is further refined using a reinforcement learning framework known as Group Relative Policy Optimization (GRPO). Scene-30K’s outputs, owing to their strict structural template, support several reward signals designed to elevate both answer quality and reasoning reliability:

    • Format Reward (RFormatR_\mathrm{Format}): Binary reward (1 or 0), verifying that predictions adhere to <think> ... <answer> ... </answer> structure.
  • Perception Reward (RpR_p): Based on the IoU between predicted scene bounding boxes (bb^*) and ground-truth (bb), Rp=IoU(b,b)R_p = \mathrm{IoU}(b, b^*).

  • Semantic Similarity Reward (RsimilarityR_\mathrm{similarity}): Calculated by cosine similarity between CLIP-encoded answer embeddings, Rsimilarity=cos_sim(CLIPtext(a^),CLIPtext(a))R_\mathrm{similarity} = \operatorname{cos\_sim}(\mathrm{CLIP_{text}}(\hat{a}), \mathrm{CLIP_{text}}(a)).

By combining these rewards, the RL phase guides the model toward reasoning chains and answers that are both formally correct and semantically precise. This staged procedure ensures that Scene-30K not only serves as a direct data resource but as a structural anchor underpinning reward engineering in model optimization.

5. Dynamic View Selection and Cross-Modal Alignment

Given that many 3D VLMs ultimately process 2D projections, Scene-30K-enabled pipelines incorporate a dynamic view selection strategy:

  • For every 3D scene, multiple 2D image renderings are generated from various camera locations.

  • For each rendered view vv and textual prompt tt, the following scores are computed:

    • SText3D(v,t)S_{\mathrm{Text}\to 3D}(v, t): Text-to-3D relevance.
    • SImage3D(v,t)S_{\mathrm{Image}\to 3D}(v, t): Alignment of image features with 3D geometry.
    • SCLIP(v,t)S_\mathrm{CLIP}(v, t): CLIP-based image–text alignment.
  • An overall view utility score is calculated: U(v)=wtSText3D(v,t)+wcSImage3D(v,t)+wclipSCLIP(v,t)U(v) = w_t \cdot S_{\mathrm{Text}\to 3D}(v, t) + w_c \cdot S_{\mathrm{Image}\to 3D}(v, t) + w_\mathrm{clip} \cdot S_\mathrm{CLIP}(v, t), where wtw_t, wcw_c, wclipw_\mathrm{clip} are learned weights.
  • The top-N views with the highest U(v)U(v) are retained for the core model, ensuring efficient and context-rich multi-view encoding (Huang et al., 31 Jul 2025).

This architecture, coupled with Scene-30K’s data design, enhances the VLM’s capability to leverage the most informative visual evidence for downstream scene reasoning.

6. Experimental Outcomes and Benchmark Significance

Experiments integrating Scene-30K-centered initialization in 3D-R1 yield, on average, a 10% improvement across a variety of tasks, including 3D scene captioning, object captioning, visual grounding, question answering, and multi-step planning. Evaluation uses established metrics such as CIDEr, BLEU-4, METEOR, ROUGE-L for captioning and IoU-based measures for spatial reasoning and grounding.

This empirical performance is attributed to the dataset’s explicit inclusion of reasoning chains and its structural enforceability. Scene-30K thus operationalizes complex spatial-semantic reasoning in 3D environments—substantially advancing the performance and interpretability of contemporary 3D-VL models such as 3D-R1 (Huang et al., 31 Jul 2025).

Scene-30K shares etymological similarity with other “30K-scale” datasets, but is distinct in both modality and scope:

  • It diverges from the Camera Scene Detection Dataset (CamSDD), which, although sometimes referred to as “Scene-30K” in downstream benchmarks, targets 2D photographic scene classification with a focus on real-world camera applications (Pouget et al., 2021).
  • Mosaic3D-5.6M, by comparison, aggregates over 30K 3D scenes but emphasizes semantic segmentation via large-scale 3D mask–text pairings and does not provide stepwise reasoning chains (Lee et al., 4 Feb 2025).

The chief distinguishing attribute of Scene-30K is the formalized provision of question–reasoning–answer triples, calibrated through synthesis pipelines and rigorous post-generation filtering, to enable interpretable, multi-hop reasoning models in 3D vision–language contexts.


In summary, Scene-30K represents a benchmark-scale synthetic dataset supporting advanced, interpretable reasoning for 3D scene understanding, serving as a foundational resource for cold-start training and subsequent reinforcement learning in state-of-the-art 3D vision–LLMs. Its unique synthesis and reward-oriented design contribute directly to significant empirical gains in spatial reasoning and unified 3D scene comprehension (Huang et al., 31 Jul 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Scene-30K Dataset.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube