Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Planner3D: LLM-enhanced graph prior meets 3D indoor scene explicit regularization (2403.12848v2)

Published 19 Mar 2024 in cs.CV

Abstract: Compositional 3D scene synthesis has diverse applications across a spectrum of industries such as robotics, films, and video games, as it closely mirrors the complexity of real-world multi-object environments. Conventional works typically employ shape retrieval based frameworks which naturally suffer from limited shape diversity. Recent progresses have been made in object shape generation with generative models such as diffusion models, which increases the shape fidelity. However, these approaches separately treat 3D shape generation and layout generation. The synthesized scenes are usually hampered by layout collision, which suggests that the scene-level fidelity is still under-explored. In this paper, we aim at generating realistic and reasonable 3D indoor scenes from scene graph. To enrich the priors of the given scene graph inputs, LLM is utilized to aggregate the global-wise features with local node-wise and edge-wise features. With a unified graph encoder, graph features are extracted to guide joint layout-shape generation. Additional regularization is introduced to explicitly constrain the produced 3D layouts. Benchmarked on the SG-FRONT dataset, our method achieves better 3D scene synthesis, especially in terms of scene-level fidelity. The source code will be released after publication.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Demystifying mmd gans. ICLR, 2018.
  2. Making large multimodal models understand arbitrary visual prompts. arXiv 2312.00784, 2023.
  3. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  4. Sdfusion: Multimodal 3d shape completion, reconstruction, and generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4456--4465, 2023.
  5. 3d u-net: learning dense volumetric segmentation from sparse annotation. In Medical Image Computing and Computer-Assisted Intervention--MICCAI 2016: 19th International Conference, Athens, Greece, October 17-21, 2016, Proceedings, Part II 19, pages 424--432. Springer, 2016.
  6. A volumetric method for building complex models from range images. In Proceedings of the 23rd annual conference on Computer graphics and interactive techniques, pages 303--312, 1996.
  7. Graph-to-3d: End-to-end generation and manipulation of 3d scenes using scene graphs. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16352--16361, 2021.
  8. From points to multi-object 3d reconstruction. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4588--4597, 2021.
  9. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 605--613, 2017.
  10. 3d-front: 3d furnished rooms with layouts and semantics. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10933--10942, 2021.
  11. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  12. Gans trained by a two time-scale update rule converge to a local nash equilibrium. NeurIPS, 30, 2017.
  13. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840--6851, 2020.
  14. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  15. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  16. Howtocaption: Prompting llms to transform video annotations at scale. arXiv preprint arXiv:2310.04900, 2023.
  17. Visual instruction tuning. arXiv 2304.08485, 2023.
  18. One-2-3-45: Any single image to 3d mesh in 45 seconds without per-shape optimization. Advances in Neural Information Processing Systems, 36, 2024.
  19. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298--9309, 2023.
  20. Iss: Image as stetting stone for text-guided 3d shape generation. arXiv preprint arXiv:2209.04145, 2022.
  21. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  22. End-to-end optimization of scene layout. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3754--3763, 2020.
  23. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  24. Orca: Progressive learning from complex explanation traces of gpt-4. arXiv 2306.02707, 2023.
  25. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023.
  26. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748--8763. PMLR, 2021.
  27. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  28. Xcube: Large-scale 3d generative modeling using sparse voxel hierarchies. arXiv preprint arXiv:2312.03806, 2023.
  29. Variational inference with normalizing flows. In International conference on machine learning, pages 1530--1538. PMLR, 2015.
  30. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741, 2022.
  31. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  32. Shapescaffolder: Structure-aware 3d shape generation from text. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2715--2724, 2023.
  33. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  34. Maxi: Multi-axis instructional videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22704--22714, 2022.
  35. Flow-based gan for 3d point cloud generation from a single image. BMVC, 2022.
  36. Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
  37. Pointflow: 3d point cloud generation with continuous normalizing flows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4541--4550, 2019.
  38. Commonscenes: Generating commonsense 3d indoor scenes with scene graphs. Advances in Neural Information Processing Systems, 36, 2023.
  39. 3d shape generation and completion through point-voxel diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5826--5835, 2021.
  40. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv 2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yao Wei (18 papers)
  2. Martin Renqiang Min (44 papers)
  3. George Vosselman (23 papers)
  4. Li Erran Li (37 papers)
  5. Michael Ying Yang (70 papers)