Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Disentangled 3D Scene Generation with Layout Learning (2402.16936v1)

Published 26 Feb 2024 in cs.CV and cs.LG

Abstract: We introduce a method to generate 3D scenes that are disentangled into their component objects. This disentanglement is unsupervised, relying only on the knowledge of a large pretrained text-to-image model. Our key insight is that objects can be discovered by finding parts of a 3D scene that, when rearranged spatially, still produce valid configurations of the same scene. Concretely, our method jointly optimizes multiple NeRFs from scratch - each representing its own object - along with a set of layouts that composite these objects into scenes. We then encourage these composited scenes to be in-distribution according to the image generator. We show that despite its simplicity, our approach successfully generates 3D scenes decomposed into individual objects, enabling new capabilities in text-to-3D content creation. For results and an interactive demo, see our project page at https://dave.ml/layoutlearning/

Disentangled 3D Scene Generation with Unsupervised Layout Learning

Introduction

The focus of advancements in artificial intelligence has often been on the ability to parse and understand complex scenes as a collection of individual entities or objects. This paper introduces a novel method for generating 3D scenes that leverages the concept of disentanglement, where scenes are automatically decomposed into their constituent objects without supervision. The approach extends the use of Neural Radiance Fields (NeRFs) from creating monolithic 3D representations to generating compositions of multiple objects that can be manipulated independently. A distinctive aspect of this work is its reliance on the priors learned by a large pretrained text-to-image model to guide the disentanglement process.

Overview of Method

The method proposed in this paper marks a significant step in 3D scene generation by defining objects as components that can be independently manipulated while maintaining a "well-formed" scene. This is achieved by optimizing multiple NeRFs, each representing a different object within a scene, alongside a set of layouts that determine the spatial arrangement of these objects. These layouts are varied and learned through the process, promoting a meaningful decomposition of the scene into identifiable objects. The scenes are further optimized to match the distribution of images generated from text descriptions, ensuring that the composed scenes are coherent and contextually relevant.

Technical Contributions

The paper makes several key contributions:

  • Introduces an operable definition of objects as parts of a scene that can undergo independent spatial manipulations while preserving scene validity.
  • Implements a novel architecture that allows for the generative composition of 3D scenes by learning a set of NeRFs along with their spatial layouts.
  • Demonstrates the utility of the proposed method in various 3D scene generation and editing tasks without requiring explicit supervision in terms of object labels, bounding boxes, or external models.

Evaluation and Findings

Quantitative and qualitative evaluations underscore the effectiveness of the layout learning approach in generating detailed 3D scenes that are accurately decomposed into individual objects. The method outperforms existing baselines in terms of the meaningfulness of the object-level decomposition, as evidenced by comparisons using CLIP scores. The paper also showcases the flexibility of the approach through applications in scene editing and object arrangement, further validating the practical utility of the proposed method.

Practical Implications and Future Directions

This work presents a significant advancement in the text-to-3D domain, offering a new tool for the creation of complex, editable 3D scenes from textual descriptions alone. The ability to disentangle these scenes into constituent objects without any form of explicit supervision opens up new avenues for content creation, providing users with granular control over the components of their generated scenes.

Looking ahead, the paper speculates on future developments in AI that could build on this foundation, such as improved techniques for unsupervised learning of object properties and relationships or the integration of dynamic elements within generated scenes. The ongoing refinement of these methods holds promise not only for more sophisticated 3D content creation tools but also for advancing our understanding of the processes by which AI can interpret and manipulate complex environments.

Concluding Remarks

This paper represents a notable step forward in the generative modeling of 3D scenes, distinguished by its unsupervised approach to disentangling scenes into individual, manipulable objects. By leveraging the capabilities of pretrained text-to-image models in a novel architecture, the authors have opened new possibilities for the creative and practical applications of AI in 3D content generation. As the field continues to evolve, the principles and methods introduced here could play a significant role in shaping the future of generative AI and its intersection with 3D modeling and design.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp.  5855–5864, October 2021.
  2. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.  5470–5479, June 2022.
  3. Biederman, I. On the semantics of a glance at a scene. In Perceptual organization, pp.  213–253. Routledge, 1981.
  4. Scene perception: Detecting and judging objects undergoing relational violations. Cognitive psychology, 14(2):143–177, 1982.
  5. Muse: Text-to-image generation via masked generative transformers. In ICML, 2023.
  6. Set-the-scene: Global-local training for generating controllable nerf scenes. In ICCV, 2023.
  7. Blobgan: Spatially disentangled scene representations. In European Conference on Computer Vision, pp.  616–635. Springer, 2022.
  8. Diffusion self-guidance for controllable image generation. In Advances in Neural Information Processing Systems, 2023.
  9. Shampoo: Preconditioned stochastic tensor optimization. In ICML, 2018.
  10. Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581, 2023.
  11. Object discovery and representation networks. In European Conference on Computer Vision, pp.  123–143. Springer, 2022.
  12. Ontogeny of object permanence and object tracking in the carrion crow, corvus corone. Animal behaviour, 82(2):359–367, 2011.
  13. Scalable adaptive computation for iterative generation. In ICML, 2023.
  14. Perceiver io: A general architecture for structured inputs & outputs. arXiv preprint arXiv:2107.14795, 2021a.
  15. Perceiver: General perception with iterative attention. In International conference on machine learning, pp.  4651–4664. PMLR, 2021b.
  16. Zero-shot text-guided object generation with dream fields. In CVPR, 2022.
  17. Repurposing diffusion-based image generators for monocular depth estimation. arXiv preprint arXiv:2312.02145, 2023.
  18. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  19. Learning visual n-grams from web data. In ICCV, 2017.
  20. Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pp.  280–296. Springer, 2022.
  21. Barf: Bundle-adjusting neural radiance fields. In ICCV, 2021.
  22. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  300–309, 2023.
  23. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  9298–9309, 2023.
  24. Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pp.  4114–4124. PMLR, 2019.
  25. Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
  26. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. arXiv preprint arXiv:2305.14334, 2023.
  27. Nerf: Representing scenes as neural radiance fields for view synthesis. In ECCV, 2020.
  28. Differentiable Blocks World: Qualitative 3D Decomposition by Rendering Primitives. In Neural Information Processing Systems, 2023.
  29. Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph., 41(4):102:1–102:15, July 2022. doi: 10.1145/3528223.3530127. URL https://doi.org/10.1145/3528223.3530127.
  30. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  31. Giraffe: Representing scenes as compositional generative neural feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  11453–11464, 2021.
  32. An analysis system for scenes containing objects with substructures. In Proceedings of the Fourth International Joint Conference on Pattern Recognitions, pp.  752–754, 1978.
  33. Counterfactual image networks, 2018. URL https://openreview.net/forum?id=SyYYPdg0-.
  34. Benchmark for compositional text-to-image synthesis. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  35. The hessian penalty: A weak prior for unsupervised disentanglement. In ECCV, 2020.
  36. The origins of intelligence in children, volume 8. International Universities Press New York, 1952.
  37. Compositional 3d scene generation using locally conditioned diffusion. arXiv preprint arXiv:2303.12218, 2023.
  38. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  39. Learning transferable visual models from natural language supervision. In ICML, 2021.
  40. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  41. Roberts, L. G. Machine perception of three-dimensional solids. PhD thesis, Massachusetts Institute of Technology, 1963.
  42. Unsupervised joint object discovery and segmentation in internet images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1939–1946, 2013.
  43. Using multiple segmentations to discover objects and their extent in image collections. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), volume 2, pp.  1605–1614. IEEE, 2006.
  44. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  45. Object scene representation transformer. Advances in Neural Information Processing Systems, 35:9512–9524, 2022.
  46. Unsupervised discovery and composition of object light fields. arXiv preprint arXiv:2205.03923, 2022.
  47. Spelke, E. S. Principles of object perception. Cognitive science, 14(1):29–56, 1990.
  48. Fourier features let networks learn high frequency functions in low dimensional domains. In Neural Information Processing Systems, 2020.
  49. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  7464–7475, 2023a.
  50. Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  12619–12629, 2023b.
  51. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023c.
  52. Wertheimer, M. Laws of organization in perceptual forms. 1938.
  53. Wilcox, T. Object individuation: Infants’ use of shape, size, pattern, and color. Cognition, 72(2):125–166, 1999.
  54. Reconfusion: 3d reconstruction with diffusion priors. arXiv preprint arXiv:2312.02981, 2023.
  55. Holodeck: Language guided generation of 3d embodied ai environments. arXiv preprint arXiv:2312.09067, 2023.
  56. Deformable sprites for unsupervised video decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2657–2666, 2022.
  57. Unsupervised discovery of object radiance fields. arXiv preprint arXiv:2107.07905, 2021.
  58. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  59. Scenewiz3d: Towards text-guided 3d scene composition. arXiv preprint arXiv:2312.08885, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dave Epstein (9 papers)
  2. Ben Poole (46 papers)
  3. Ben Mildenhall (41 papers)
  4. Aleksander Holynski (37 papers)
  5. Alexei A. Efros (100 papers)
Citations (13)