Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 40 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 75 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 466 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

CasaGPT: Cuboid Arrangement and Scene Assembly for Interior Design (2504.19478v1)

Published 28 Apr 2025 in cs.CV

Abstract: We present a novel approach for indoor scene synthesis, which learns to arrange decomposed cuboid primitives to represent 3D objects within a scene. Unlike conventional methods that use bounding boxes to determine the placement and scale of 3D objects, our approach leverages cuboids as a straightforward yet highly effective alternative for modeling objects. This allows for compact scene generation while minimizing object intersections. Our approach, coined CasaGPT for Cuboid Arrangement and Scene Assembly, employs an autoregressive model to sequentially arrange cuboids, producing physically plausible scenes. By applying rejection sampling during the fine-tuning stage to filter out scenes with object collisions, our model further reduces intersections and enhances scene quality. Additionally, we introduce a refined dataset, 3DFRONT-NC, which eliminates significant noise presented in the original dataset, 3D-FRONT. Extensive experiments on the 3D-FRONT dataset as well as our dataset demonstrate that our approach consistently outperforms the state-of-the-art methods, enhancing the realism of generated scenes, and providing a promising direction for 3D scene synthesis.

Summary

  • The paper presents an autoregressive framework that leverages cuboid abstractions to generate intersection-free, physically plausible indoor scenes.
  • It integrates rejection sampling and curated datasets, achieving a 90.6% non-intersection rate in bedrooms compared to state-of-the-art methods.
  • CasaGPT employs a modified Llama-3 Transformer to sequence cuboid tokens, enabling realistic scene completion and re-arrangement.

CasaGPT: Cuboid Arrangement and Scene Assembly for Intersection-Free Indoor Scene Synthesis

Introduction and Motivation

CasaGPT introduces a novel autoregressive framework for indoor scene synthesis, focusing on intersection-free arrangement of 3D objects by decomposing them into cuboid primitives. The method addresses two major limitations in prior work: (1) the prevalence of object intersections in training data and generated scenes due to coarse bounding box representations, and (2) the inability of generative models to enforce physical plausibility during synthesis. By leveraging fine-grained cuboid abstractions and integrating rejection sampling into the training pipeline, CasaGPT achieves compact, realistic, and physically plausible scene layouts. Figure 1

Figure 1: Overview of the CasaGPT framework for indoor scene generation, illustrating model pre-training with cuboid tokens and iterative rejection sampling for collision-free scene refinement.

Cuboid-Based Scene Representation

CasaGPT replaces traditional bounding box representations with assemblies of cuboid primitives, derived via voxelization and coarse-graining of 3D meshes. The cuboid abstraction process merges contiguous occupied voxels along principal axes, followed by iterative merging based on a dynamic threshold that penalizes excessive volume inflation. This results in a compact, yet expressive, representation of object geometry that more accurately reflects mesh-level spatial relationships. Figure 2

Figure 2: Workflow of voxelization, coarse-graining, and cuboid merging, yielding compact cuboid representations for 3D objects.

The scene is encoded as a sequence of tokens: each object is represented by an entity token (class, size, translation, rotation) followed by its constituent cuboid tokens, sorted by height. This tokenization enables the autoregressive model to capture both global and local spatial dependencies.

Autoregressive Modeling and Rejection Sampling

CasaGPT employs a Llama-3-based Transformer decoder, modified with RMSNorm, SwiGLU, and learned position encoding, to model the sequential arrangement of cuboid tokens. During training, teacher forcing is used to accelerate convergence and improve token prediction accuracy.

To mitigate the probabilistic generation of intersecting objects, CasaGPT integrates rejection sampling during fine-tuning. Candidate scenes are generated and filtered based on 3D IoU thresholds between cuboids; only intersection-free samples are retained for further training. This iterative process distills a policy that increasingly favors physically plausible layouts.

Dataset Curation: 3DFRONT-NC

Recognizing the limitations of the 3D-FRONT dataset, which contains numerous object intersections, the authors curate 3DFRONT-NC ("noise clean") by applying cuboid-based intersection avoidance. The process involves computing the IoU matrix for all cuboids, followed by gradient-based translation updates to eliminate intersections without perturbing non-intersecting arrangements. Figure 3

Figure 3: Dataset refinement via intersection avoidance, preserving plausible arrangements while eliminating collisions.

Quantitative analysis demonstrates that 3DFRONT-NC exhibits significantly higher non-intersection rates (NIRate) across all room types, especially bedrooms and libraries. Figure 4

Figure 4: Non-intersection rate comparison between 3D-FRONT and 3DFRONT-NC, highlighting the effectiveness of dataset curation.

Object Retrieval and Scene Assembly

During inference, CasaGPT retrieves 3D models from the 3D-FUTURE database by matching generated cuboid assemblies to candidate objects via 3D IoU in voxel space. This fine-grained retrieval avoids the intersection issues inherent in bounding box-based nearest neighbor methods. Figure 5

Figure 5: Comparison of object retrieval methods, demonstrating superior intersection avoidance with cuboid-based retrieval.

Experimental Results

CasaGPT is evaluated on both the original and curated datasets, outperforming state-of-the-art methods (ATISS, DiffuScene, LayoutGPT) in FID, CIoU, and NIRate metrics. On 3DFRONT-NC, CasaGPT achieves a 90.6% non-intersection rate for bedrooms, a 31% improvement over previous methods. The model also demonstrates strong generalization to complex floor plans and customized layouts. Figure 6

Figure 6: Qualitative comparison of scene generation across methods, with CasaGPT exhibiting superior intersection avoidance and plausibility.

Ablation studies confirm the advantages of cuboid representation over bounding boxes for both dataset refinement and sequence modeling. Iterative rejection sampling further improves intersection metrics, with minimal impact on distributional fidelity (as measured by FID). Figure 7

Figure 7: Visual comparison of object arrangement using cuboid vs. bounding box representations for dataset refinement.

Applications: Scene Completion and Re-Arrangement

CasaGPT supports scene completion by auto-regressively predicting missing objects, and scene re-arrangement by resampling low-probability objects. In both tasks, CasaGPT outperforms ATISS in maintaining coherence and avoiding intersections. Figure 8

Figure 8: Scene re-arrangement results, with magenta indicating disrupted objects; CasaGPT achieves superior correction and plausibility.

Limitations and Future Directions

Despite substantial improvements, CasaGPT exhibits limitations in highly constrained scenes, occasionally placing objects outside floor boundaries or violating ergonomic principles. Failure cases include inaccessible regions and suboptimal object orientations. Figure 9

Figure 9: Failure cases including boundary crossing and ergonomic violations.

Future work may explore denoising diffusion models for cuboid generation and reinforcement learning approaches that directly optimize intersection volumes as reward signals.

Computational Considerations

CasaGPT's model size (27.4M parameters) and inference time (29.63ms per forward pass) are competitive, with only modest overhead in object retrieval compared to ATISS and DiffuScene. Training with rejection sampling increases GPU hours but yields substantial gains in scene quality.

Conclusion

CasaGPT advances the state of indoor scene synthesis by introducing a cuboid-based autoregressive framework with integrated rejection sampling and dataset curation. The approach achieves intersection-free, compact, and realistic scene layouts, validated by strong quantitative and qualitative results. The framework sets a new standard for physically plausible scene generation and opens avenues for further research in generative 3D modeling, dataset refinement, and reinforcement learning-based layout optimization.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube