XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

Published 6 Dec 2023 in cs.CV, cs.GR, and cs.LG | (2312.03806v2)

Abstract: We present XCube (abbreviated as $\mathcal{X}^3$), a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to $1024^3$ in a feed-forward fashion without time-consuming test-time optimization. To achieve this, we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects, we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m$\times$100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation, we show that our model can be used to solve a variety of tasks such as user-guided editing, scene completion from a single scan, and text-to-3D. The source code and more results can be found at https://research.nvidia.com/labs/toronto-ai/xcube/.

Abstract PDF Upgrade to Chat

Authors (6)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces a hierarchical latent diffusion model leveraging sparse voxel hierarchies, enabling efficient high-resolution 3D generation without test-time optimization.
It combines a sparse variational autoencoder with latent diffusion, significantly improving speed, quality, and scalability over existing 3D generative methods.
Its capacity for user-guided editing and application to both object-level and scene-level tasks opens avenues for simulations, VR, and autonomous driving.

Voxel Craft: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies

The paper under discussion introduces a novel generative model termed Voxel Craft (VoxCraft), designed for high-resolution 3D volumetric content creation. Through the use of sparse voxel hierarchies, VoxCraft effectively produces 3D shapes with attributes such as surface normals, semantics, and truncated signed distance functions (TSDF) in a computationally efficient manner. The significance of this work lies in its capability to handle large-scale scenes while maintaining a high effective resolution, offering improvements over existing methodologies in terms of quality, diversity, and speed.

Overview of the Paper

The key contribution of this paper is the development of a hierarchical voxel latent diffusion model which leverages the inherent sparsity of 3D data. The methodology revolves around constructing a hierarchical multi-scale generative model that progresses from coarse to fine voxel grids, effectively capturing the shape and detail through successive layers. This hierarchical design is pivotal in generating high-resolution outputs (up to $1024^3$ voxels) efficiently without resorting to test-time optimization.

The Voxel Craft model is constructed on the foundation of a highly efficient Volumetric Bounding Data (VBD) structure to store and manage sparse voxel data, ensuring memory efficiency and speed in processing. The approach combines a sparse variational autoencoder (VAE) with latent diffusion models to model each level of the hierarchy conditioned on its preceding, coarser layer. This results in not just a flexible and scalable modeling process but also significant qualitative improvements over other methods, as demonstrated in various scenarios including object-level tasks (e.g., ShapeNet) and large-scale outdoor scene generation (e.g., Waymo Open Dataset).

Results and Implications

From a quantitative perspective, VoxCraft demonstrates superiority in both object generation and scene-level applications, as clarified through the metrics and user studies presented. Notably, the model surpasses various state-of-the-art models in the tasks of text-to-3D generation and scene completion from sparse data inputs like a single 3D scan. An important feature of the VoxCraft model is its ability to support user-guided editing, where coarse voxel grids can be manipulated to regenerate plausible high-resolution outputs, offering potential avenues for interactive applications in gaming and virtual reality.

The implications of VoxCraft are vast, stretching from practical applications such as urban planning and simulation for autonomous driving, to broader theoretical implications in improving the scalability and efficiency of 3D generative models. The ability to generate and manipulate data at such scales without prohibitive computational demands marks a significant advancement.

Future Developments

Future expansions of this work are likely to explore enhanced integration of text and image-conditioned 3D generation, leveraging the existing structure to guide more complex generative tasks. Another intriguing development could incorporate feedback from interactive editing directly into the generative process, making the system more responsive and adaptable to real-time changes. Furthermore, extending the framework to accommodate varying levels of detail on demand could offer even more versatility in fields where rapid and high-fidelity content creation is crucial.

In conclusion, this paper presents a substantial progression in 3D generative modeling, addressing capabilities that are crucial for both academic inquiry and practical utility. The Voxel Craft model stands as a testament to the potential locked within hierarchical structures and sparse representations, setting a foundation for ensuing innovations in the graphical AI domain.

Markdown Report Issue