CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets (2406.13897v1)

Published 30 May 2024 in cs.CV

Abstract: In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.

Citations (39)

View on Semantic Scholar

Summary

The paper introduces CLAY, a scalable generative model that integrates a multi-resolution VAE with a diffusion transformer to produce detailed 3D assets from diverse inputs.
It standardizes 3D data through remeshing and precise GPT-4V annotations to ensure watertight meshes and reliable training datasets.
CLAY enhances asset quality by synthesizing physically-based rendering textures via a two-stage pipeline, delivering production-ready digital models.

Overview of "CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets"

The paper introduces CLAY, a Large-scale Generative Model designed for generating high-quality 3D digital assets with multiple controls. CLAY addresses the complexities and limitations inherent in current digital tools, emphasizing scalability and the ability to convert text and image inputs into highly detailed 3D geometries. This model represents a significant step in synthesizing both geometric accuracy and realistic texture application in 3D asset generation.

3D Model Framework

Central to CLAY’s architecture is the integration of a multi-resolution Variational Autoencoder (VAE) and a minimalist Diffusion Transformer (DiT), enabling the model to extract rich 3D priors from a wide range of geometric representations. The VAE encodes point clouds into latent spaces while ensuring continuity and detail in surface depiction. The DiT facilitates scalable training, progressively handling large volumes of data with its transformer blocks.

The architecture’s adaptability allows CLAY to effectively generate 3D geometries from diverse inputs using advanced attention mechanisms that maintain spatial accuracy. This ensures that the resultant 3D models not only respect the input data’s geometry but also adapt smoothly across different resolutions and complexities.

Data Standardization

A standout component of CLAY is its robust data processing pipeline. Recognizing the limitations of available 3D datasets like ShapeNet and Objaverse, the authors implement a standardization protocol involving remeshing methods to achieve watertight meshes, critical for accurate 3D representation.

By leveraging the capabilities of the GPT-4V model, CLAY integrates precise annotation schemes to enhance data quality further. This approach ensures a comprehensive and consistent dataset foundation, pivotal for training large-scale models such as CLAY.

Asset Enhancement and Material Synthesis

CLAY’s asset enhancement extends beyond mere geometry generation to include advanced material synthesis. The model generates physically-based rendering (PBR) textures, covering diffuse, roughness, and metallic modalities, far surpassing the scope of previous methods. This integration is powered by a targeted multi-view material diffusion model, allowing for efficient, high-quality texture generation.

The two-stage pipeline—including mesh quadrification and atlasing—prepares the mesh for realistic rendering, enabling seamless integration into existing digital production environments. This approach underlines the model’s ability to produce directly deployable production-quality assets.

Model Adaptation and Conditional Generations

CLAY’s versatility is showcased in its ability to handle various conditional generation tasks, supporting inputs like text, images, and more complex 3D primitives.

The model’s conditioning scheme, facilitated by cross-attention techniques, allows it to adaptively generate 3D content from diverse modalities—demonstrating superiority in both creativity and fidelity. Critical evaluations reveal that CLAY’s multi-view conditioning capabilities result in robust, high-fidelity reconstructions consistent with provided textual or visual inputs.

Implications and Future Directions

CLAY exemplifies a highly adaptable generative framework, addressing key challenges in creative and entertainment industries. Its emphasis on scalable, high-quality 3D generation could significantly reduce the bottleneck of creative content creation, enabling more interactive and pervasive digital environments.

While CLAY’s advancements are notable, further developments could focus on making the pipeline more integrated to combine geometry and material generation seamlessly. Additionally, leveraging larger datasets and refining the model’s ability to handle highly complex composed objects could enhance its applicability.

In conclusion, CLAY sets a benchmark for future research in 3D asset generation, opening doors for exploration in dynamic and interactive digital environments using advanced generative modeling frameworks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/DeemosTech/status/1805236190066708493

https://twitter.com/Almorgand/status/1818665742318837932

YouTube

Show All Videos