CLIP-Mesh: Generating textured meshes from text using pretrained image-text models (2203.13333v2)

Published 24 Mar 2022 in cs.CV, cs.GR, and cs.LG

Abstract: We present a technique for zero-shot generation of a 3D model using only a target text prompt. Without any 3D supervision our method deforms the control shape of a limit subdivided surface along with its texture map and normal map to obtain a 3D asset that corresponds to the input text prompt and can be easily deployed into games or modeling applications. We rely only on a pre-trained CLIP model that compares the input text prompt with differentiably rendered images of our 3D model. While previous works have focused on stylization or required training of generative models we perform optimization on mesh parameters directly to generate shape, texture or both. To constrain the optimization to produce plausible meshes and textures we introduce a number of techniques using image augmentations and the use of a pretrained prior that generates CLIP image embeddings given a text embedding.

PDF Abstract

Overview of "CLIP-Mesh: Generating Textured Meshes from Text Using Pretrained Image-Text Models"

The paper "CLIP-Mesh: Generating Textured Meshes from Text Using Pretrained Image-Text Models" presents a novel approach for generating 3D models directly from textual prompts without any 3D supervision. The methodology leverages a pretrained CLIP model, known for its capability to encode textual and visual information into a shared embedding space, enabling a zero-shot generation process. Unlike prior work which often relies on stylization or requires comprehensive training of generative models, CLIP-Mesh directly optimizes mesh parameters to generate both shape and texture.

Methodological Innovations

The paper proposes a solution that integrates several key techniques to manage the intricacies associated with 3D mesh generation:

Differentiable Rendering and Optimization: The methods employ a differentiable rendering pipeline which facilitates the comparison of rendered images against textual descriptions within the CLIP embedding space. This allows for a direct optimization process to align the generated 3D meshes with the input text descriptions.
Constrained Optimization: The process introduces constraints to ensure that the generated meshes and textures remain plausible. Regularization techniques, including Laplacian smoothing, and image augmentation tactics are incorporated to circumvent irregularities and artifacts in the meshes.
Subdivision Surface Regularization: The use of Loop subdivision surfaces provides a smooth and continuous representation of the meshes, enabling high-quality results with computational efficiency.
Generative Priors: A diffusion model serves to provide an additional generative prior which associates text prompts with likely image embeddings, enhancing the fidelity and variety of the generated 3D objects when used in conjunction with the CLIP model.

Results

The experimental validation indicates that CLIP-Mesh outperforms benchmark models like Dreamfields in generating recognizable 3D shapes from abstract descriptions. Practical applications demonstrated include creating detailed household objects and complex landmark structures, illustrating the model's versatility and robustness.

Implications and Future Directions

CLIP-Mesh represents a significant step forward in democratizing 3D content creation by enabling automatic generation of textured meshes from plain textual input. This has immediate implications for fields like gaming, virtual reality, and augmented reality where rapid prototyping and object integration are key.

However, inherent limitations such as genus constraints originating from initial mesh topology and potential issues with semantic misinterpretations by the CLIP model (e.g., unintended text artifacts on models) offer avenues for future exploration. Enhancing the control users have over the generative process and integrating more sophisticated shape-based constraints could address these challenges.

In conclusion, CLIP-Mesh introduces a practical and efficient framework for 3D asset generation from text, leveraging the broad knowledge encapsulated within large-scale vision-LLMs. As AI progresses, extending such methodologies could ultimately allow seamless, high-fidelity, and context-aware 3D content synthesis, transforming creative and industrial digital workflows.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Nasir Mohammad Khalid (1 paper)
Tianhao Xie (5 papers)
Eugene Belilovsky (68 papers)
Tiberiu Popa (7 papers)

Citations (271)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos