Text2Mesh: Text-Driven Neural Stylization for Meshes (2112.03221v1)

Published 6 Dec 2021 in cs.CV, cs.CL, and cs.GR

Abstract: In this work, we develop intuitive controls for editing the style of 3D objects. Our framework, Text2Mesh, stylizes a 3D mesh by predicting color and local geometric details which conform to a target text prompt. We consider a disentangled representation of a 3D object using a fixed mesh input (content) coupled with a learned neural network, which we term neural style field network. In order to modify style, we obtain a similarity score between a text prompt (describing style) and a stylized mesh by harnessing the representational power of CLIP. Text2Mesh requires neither a pre-trained generative model nor a specialized 3D mesh dataset. It can handle low-quality meshes (non-manifold, boundaries, etc.) with arbitrary genus, and does not require UV parameterization. We demonstrate the ability of our technique to synthesize a myriad of styles over a wide variety of 3D meshes.

Citations (315)

View on Semantic Scholar

Summary

The paper introduces a novel neural style field (NSF) that transforms 3D mesh appearance through text-driven cues using CLIP embeddings.
It employs multi-view rendering and Fourier feature mappings to achieve high-quality, semantically consistent stylizations even on low-quality meshes.
The approach enables rapid, intuitive stylization of digital assets with applications in entertainment, prototyping, and education.

Insightful Overview of "TEXTureSTYLE: Text-Driven Neural Stylization for Meshes"

The paper "TEXTureSTYLE: Text-Driven Neural Stylization for Meshes" introduces a novel approach for the stylization of 3D meshes driven by textual descriptions. This method addresses the challenge of transforming the appearance of 3D models—specifically in terms of color and local geometric details—based on descriptive textual input, without requiring pre-trained generative models or specialized datasets.

Summary

TEXTureSTYLE employs a neural network, termed the neural style field (NSF), to execute the stylization process. This system utilizes a combination of input 3D content, defined by a mesh, and stylization instructions provided in natural language. The NSF adapts the shape's style through colors and displacements, leveraging the powerful joint text-image embedding capabilities of the CLIP model.

The method stands out for its ability to process low-quality meshes that can have non-manifold edges or higher genus characteristics, overcoming the often challenging requirement for UV parameterization. The neural network optimizes style features through multiple 2D views rendered into a CLIP-based semantic space, ensuring the resultant changes appropriately align with the given textual prompt.

Technical Contributions

Content and Style Representation: By treating the 3D mesh as a representation of content and the NSF as a representation of style, the approach effectively disentangles the overall shape from its appearance.
Use of CLIP for Style Representation: Utilizing CLIP to glean a similarity score between the text prompt and multi-view images of the stylized mesh is novel and bypasses the necessity of pre-trained models specific to mesh data.
Neural Style Field Network: The NSF employs an architecture capable of mapping points on a mesh to their respective style attributes, which includes both RGB colors and geometric displacements, significantly expanding the adaptability of 3D model stylization.
Avoidance of Degenerate Solutions: Through the clever use of neural regularization and Fourier feature mappings, TEXTureSTYLE addresses the inclination of networks to converge towards inadequate stylization, maintaining high-quality and semantically consistent outputs.

Implications

The practical implications of TEXTureSTYLE are profound, providing a flexible platform for the creation of diverse 3D model appearances driven purely by text. This has potential utility in entertainment, prototyping, and education, where versatile model aesthetics might be required quickly and intuitively. Theoretically, the approach lays groundwork for further exploration in the usage of textual embeddings for multimodal applications across AI, particularly in the synthesis and transformation of digital assets.

The model’s ability to seamlessly integrate text, potentially along with other media—such as image-based inputs—demonstrates its robustness and potential expansibility into even more diverse multimedia domains. Future endeavors could pursue improving content manipulation alongside stylization, enabling comprehensive transformation of 3D model attributes sourced from descriptive language alone.

Conclusion

TEXTureSTYLE presents a significant advancement in mesh stylization, showcasing that text, when combined with a powerful neural representation and existing image-text embeddings like CLIP, can provide insightful direction for complex graphic tasks. The findings and methodologies within this paper set the stage for extensive exploration in text-driven media transformation, with far-reaching implications for how we interact with and manipulate digital content. The approach heralds a new level of integration between language and digital creation, potentially reshaping workflows within creative industries and beyond.

PDF Markdown

Related Papers

YouTube

Show All Videos