MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D (2411.02336v1)

Published 4 Nov 2024 in cs.CV

Abstract: Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D meshes from the Objaverse dataset and the entire GSO dataset, respectively. Extensive experimental results demonstrate that MVPaint surpasses existing state-of-the-art methods. Notably, MVPaint could generate high-fidelity textures with minimal Janus issues and highly enhanced cross-view consistency.

References (64)

Summary

The paper presents the MVPaint framework that overcomes multi-view inconsistencies and the Janus problem through synchronized diffusion for 3D texture generation.
It introduces a three-stage process—synchronized multi-view generation, spatial-aware 3D inpainting, and UV refinement—to address challenges in UV mapping and texture completion.
Experimental evaluations on Objaverse T2T and GSO T2T benchmarks demonstrate its superior performance, paving the way for advanced 3D asset creation in VR and animation.

Insights on Synchronized Multi-View Diffusion for 3D Texture Generation

The paper presents MVPaint, a comprehensive framework for generating high-fidelity 3D textures from textual descriptions. It addresses contemporary challenges in 3D texturing, particularly the need for seamless texture generation across multiple views with minimal reliance on UV unwrapping quality. This paper is divided into innovative stages, contributing significantly to the field of large-scale 3D asset production.

MVPaint's architecture addresses fundamental issues in existing methods, such as local discontinuities and Janus problems, which arise from handling multiple independent view generations. The experimental results underscore the model's effectiveness in generating consistent and detailed textures, outperforming existing state-of-the-art techniques.

Key Components of MVPaint

MVPaint operates through three major stages:

Synchronized Multi-View Generation (SMG): This initial process employs a multi-view diffusion model to simultaneously generate multi-view images from a text prompt. Leveraging cross-attention and synchronization with the UV space, this stage minimizes the Janus problem by ensuring consistent low-resolution multi-view images. The synchronization in image space alleviates issues associated with latent space operations, which often suffer from UV mapping complications.
Spatial-aware 3D Inpainting (S3I): Following initial image generation, MVPaint handles texture completion through an inpainting method that leverages spatial relationships among 3D points derived from mesh surfaces. This learning-free approach propagates color across 3D points, demonstrating robustness against complex UV unwrapping and occlusion issues.
UV Refinement (UVR): The final stage embellishes the rough 3D texture, conducting super-resolution and spatial-aware seam smoothing to upgrade the texture map to higher resolutions. This provides consistency and detail, particularly where previous stages might introduce discrepancies due to UV mapping irregularities.

Evaluation and Implications

Two benchmarks, Objaverse T2T and GSO T2T, were established to evaluate the framework. They underline the model's superior performance in ensuring cross-view consistency and handling diverse textures, as reflected in metrics like FID, KID, and user studies. The model showcases significant advancement over previous methods such as Paint3D, SyncMVD, and TEXTure, mainly due to its emphasis on comprehensive UV refinement and geometry-aware synthesis.

MVPaint facilitates broader applications by allowing texturing of generative models from AI systems like MeshXL or MeshAnything. Its ability to address traditional challenges extends it beyond game design into areas like virtual reality and animation, promising more detailed and lifelike 3D asset generation.

Limitations and Future Work

The paper acknowledges potential avenues for enhancement, such as improving aesthetic qualities beyond the baseline established by models like SDXL. While the current implementation focuses on text prompts for texture guidance, expanding to image prompts through advanced modifications, such as adaptable diffusion models or integrating image-to-multiview approaches, can enhance the versatility of the framework.

The development of MVPaint signifies a pivotal step towards more efficient, consistent texture modeling in 3D environments. Its robust architectural framework provides a foundational platform that encourages further exploration and adaptation in the rapidly evolving sector of 3D content creation, particularly in AI-driven texture generation.