Papers
Topics
Authors
Recent
Search
2000 character limit reached

SceneGPT: 3D Scene Modeling with LLMs

Updated 10 February 2026
  • SceneGPT is a framework that uses large language models for 3D scene understanding, generation, and interactive manipulation through structured scene graphs and tokenized representations.
  • It employs techniques like textual serialization of scene graphs and chain-of-thought querying, achieving high accuracy in spatial reasoning and geometric comparisons without explicit 3D supervision.
  • The paradigm supports autoregressive scene synthesis and modular interactive editing, enabling precise spatial arrangement, collision avoidance, and enhanced 3D scene assembly in various applications.

SceneGPT refers to a class of LLM–driven frameworks for 3D scene understanding, generation, and manipulation, leveraging LLMs for spatial reasoning, compositional synthesis, and interactive or task-driven scene application. Recent approaches under the “SceneGPT” moniker span from 3D spatial reasoning directly with pre-trained LLMs, to end-to-end 3D scene assembly, graph-based spatial understanding, and practical scene construction. The core paradigm is the exploitation of linguistic knowledge and sequence modeling within LLMs to overcome the traditional data bottlenecks and task rigidity of supervised 3D vision models, thus enabling broader generalization, richer reasoning, and flexible spatial manipulation.

1. Foundational Paradigm: LLMs for 3D Scene Understanding

SceneGPT, in its most canonical form (Chandhok, 2024), demonstrates that pre-trained LLMs, when provided structured but text-based representations of 3D scenes (typically as scene graphs), can perform complex spatial reasoning without explicit 3D supervision. The essential workflow is:

  • Scene Graph Construction: Represent the environment as a scene graph G=(V,E)G=(V,E), where nodes encode object attributes (e.g., bounding box (xi,yi,zi,wi,hi,di)(x_i,y_i,z_i,w_i,h_i,d_i), class tag, appearance/caption, color/material) and edges indicate spatial relations or semantic connections.
  • Textual Serialization: The graph is serialized to a JSON-structured prompt, which may contain per-object semantic and geometric descriptors.
  • LLM Chain-of-Thought Querying: The LLM (e.g., GPT-4) receives the scene JSON, a system prompt defining field semantics, and step-by-step "few-shot" reasoning demonstrations. The query can target diverse tasks such as geometric comparison ("Which is bigger: the couch or the pillow?"), spatial location ("Where is the lamp relative to the bed?"), or affordance inference ("Which object can hold water?").
  • Zero 3D Supervision: No fine-tuning or pre-training is performed on 3D data; task specialization arises solely from in-context chain-of-thought examples.

Empirical results illustrate that this approach achieves high accuracy in object retrieval (≈95%), spatial reasoning (≈90%), and geometric comparison (≈92%) for moderate-sized scenes. Limitations include front-end detection/captioning errors and LLM context window bottlenecks for very large scenes.

2. 3D Scene Synthesis and Autoregressive Generation

SceneGPT paradigms extend to scene synthesis, where LLMs or transformer architectures are trained to generate 3D scenes that are physically plausible and compositionally valid.

  • CasaGPT: Cuboid Arrangement and Scene Assembly (Feng et al., 28 Apr 2025)
    • Cuboid Decomposition: Meshes are voxelized and merged into axis-aligned cuboids, greatly reducing mesh complexity. Each object is tokenized into an entity token and a sequence of cuboid tokens (containing class, translation, size, and rotation).
    • Autoregressive Transformer Decoder: The model stacks 8 transformer decoder layers (Llama-3 backbone) to predict tokens sequentially, conditioned on previous tokens and a learned floorplan encoder.
    • Training and Fine-tuning: Initial training uses teacher forcing on the 3D-FRONT (or cleaned 3DFRONT-NC) dataset with logistic mixture likelihoods. A critical procedure is rejection-sampling based fine-tuning: samples are generated, filtered by a cuboid intersection-over-union (IoU) threshold, and the model is iteratively fine-tuned on non-intersecting scenes.
    • Dataset Refinement: A novel collision-based optimization increases the non-intersection rate (NIRate) from 80.6% to 90.6% in bedroom scenes.
    • Results: CasaGPT sets a new state-of-the-art by reducing cuboid IoU (CIoU) and improving both NIRate and FID compared to ATISS and DiffuScene.

3. Scene Graph–Centric Spatial Reasoning

SceneGPT techniques leverage scene graphs to mediate both scene understanding and composition.

  • GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs (Gao et al., 2023)
    • LLM-powered Scene Graph Extraction: A prompt to ChatGPT instructs extraction of all entities and their pairwise relationships from natural language, forming nodes (objects + attributes) and edges (relations).
    • SDF-based Geometry: Each node is encoded as a separate signed distance field (SDF); edge prompts encode relationship triplets for text-conditioned diffusion-based geometry synthesis.
    • Interpenetration Constraints: Strong penalties for multiple objects “claiming” the same physical space enforce compositional disentanglement.
    • Score-Distillation Sampling: Volume renders of objects, edge pairs, and the global scene are matched to frozen diffusion/text-image models via a multi-level SDS loss.
    • The approach allows for faithful translation of scene-graph semantics into 3D geometry, addressing the limitations of holistic text-to-3D synthesis.
  • GPT4SGG: SGG from Holistic and Region-Specific Narratives (Chen et al., 2023)
    • Uses detected object bounding boxes and captions (holistic and region-specific via Blip2) to supply global and localized context to the LLM.
    • The LLM (e.g., GPT-4) outputs structured scene graphs by reasoning over this multi-granular textual input.
    • Quantitatively, this approach increases recall on long-tail predicates up to 2×, demonstrating the value of LLM reasoning across partial views.

4. Vision-LLMs and 3D Video/Scene Understanding

  • GPT4Scene: Visual Prompting for 3D from Video (Qi et al., 2 Jan 2025)
    • Identifies that vanilla VLMs (e.g., LLaVA, GPT-4o) lack global-to-local scene correspondence, limiting 3D spatial reasoning.
    • BEV + Consistent Marker Paradigm: Videos are reconstructed to a fused point cloud, from which a BEV is rendered; object IDs are overlayed consistently in both frames and BEV. This augmented input enables explicit alignment of scene entities.
    • Visual Prompt Injection: Marked frames and the BEV image are concatenated and input to the VLM; a single cross-entropy objective suffices for training.
    • Results: Zero-shot prompting improves GPT-4o’s performance on ScanQA BLEU-1 by 4.7 points. Fine-tuning with 165k samples enables Qwen2-VL-7B to achieve up to 43.4 BLEU-1 and 90.9 CIDEr on ScanQA—state-of-the-art for 3D QA, captioning, and grounding.
    • Intriguing Finding: Once trained with BEV+marker prompting, the model maintains strong spatial reasoning even without markers, showing intrinsic acquisition of 3D understanding.

5. Interactive and Modular Scene Construction

SceneGPT principles are applied to user-facing scene creation tools that tightly couple LLM modularity and graphical controls.

  • MoGraphGPT: Modular LLMs for Interactive 2D Scenes (Ye et al., 7 Feb 2025)
    • Assigns each scene element its own “LLM module”, responsible for element-level code and state, overseen by a central interaction LLM.
    • The GUI allows spatial manipulation, graphical proxy assignment (points, curves, regions), and auto-generates UI sliders for numeric scene parameters immediately actionable via code.
    • Quantitative results demonstrate massive reductions in prompt count, edit cycle time, and subjective user effort over non-modular codegen baselines, supporting the value of LLM modularity for interactive scene editing and control.
  • SceneGenAgent: Agent-Based Industrial Scene Assembly (Xia et al., 2024)
    • A pipeline for precise, code-based 3D industrial scene generation, with explicit layout planning, constraint checking (absolute, relative, collision-avoidance), iterative error correction, and final rendering via C# and Siemens Process Simulate.
    • SceneInstruct, a large fine-tuning dataset of planning/verification/revision sub-dialogues, enables open LLMs (Llama3.1-70B) to achieve near-closed model performance (78.5% pass@1).

6. Methodological Limitations and Prospects

  • Context and Token Limits: Fine-grained, large, or cluttered scenes surpass current LLM context windows; possible solutions include hierarchical summarization, retrieval-augmented chunking, and context-efficient protocols (Chandhok, 2024).
  • Grounding and Detection Uncertainty: Downstream reasoning accuracy is currently bounded by the reliability of earlier vision/captioning pipelines; propagating multiple hypotheses or Bayesian beliefs into LLM queries is an open direction.
  • Ergonomics, Physical Validity, Scalability: Current generative pipelines do not incorporate ergonomic, affordance, or full-geometry collision losses (Feng et al., 28 Apr 2025). Integration of RL-based intersection penalties and explicit ergonomic objectives are proposed as future work.
  • Real-Time, Large-Scale, and Multimodal Expansion: Latency of LLM-based graph prediction and the requirements for end-to-end differentiability motivate research into INT8 quantization, kv-cache optimization, and joint LLM-vision training (Zhu et al., 6 Sep 2025, Chen et al., 18 Apr 2025).

7. Synthesis and Outlook

SceneGPT frameworks unify the representational and inferential capabilities of LLMs with task-driven 3D spatial modeling. Whether applied to reasoning (question answering, retrieval, spatial logic), generative synthesis (object-wise and relation-wise), or interactive editing (modular code-by-proxy), SceneGPT implementations consistently demonstrate that manipulating structured scene graphs or tokenized object descriptions enables powerful and flexible scene intelligence across multiple domains. Ongoing challenges include context bottlenecks, cross-modal robustness, precise geometry synthesis, and scaling to real-time or “open world” settings. The modular, language-centric, and graph-based approaches of SceneGPT provide a foundation for embodied AI agents and design assistants with compositional, spatial, and semantic fluency in complex environments (Chandhok, 2024, Feng et al., 28 Apr 2025, Gao et al., 2023, Qi et al., 2 Jan 2025, Ye et al., 7 Feb 2025, Xia et al., 2024).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SceneGPT.