Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 41 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 115 tok/s Pro
Kimi K2 219 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

MVPaint: Synchronized Multi-View Diffusion for Painting Anything 3D (2411.02336v1)

Published 4 Nov 2024 in cs.CV

Abstract: Texturing is a crucial step in the 3D asset production workflow, which enhances the visual appeal and diversity of 3D assets. Despite recent advancements in Text-to-Texture (T2T) generation, existing methods often yield subpar results, primarily due to local discontinuities, inconsistencies across multiple views, and their heavy dependence on UV unwrapping outcomes. To tackle these challenges, we propose a novel generation-refinement 3D texturing framework called MVPaint, which can generate high-resolution, seamless textures while emphasizing multi-view consistency. MVPaint mainly consists of three key modules. 1) Synchronized Multi-view Generation (SMG). Given a 3D mesh model, MVPaint first simultaneously generates multi-view images by employing an SMG model, which leads to coarse texturing results with unpainted parts due to missing observations. 2) Spatial-aware 3D Inpainting (S3I). To ensure complete 3D texturing, we introduce the S3I method, specifically designed to effectively texture previously unobserved areas. 3) UV Refinement (UVR). Furthermore, MVPaint employs a UVR module to improve the texture quality in the UV space, which first performs a UV-space Super-Resolution, followed by a Spatial-aware Seam-Smoothing algorithm for revising spatial texturing discontinuities caused by UV unwrapping. Moreover, we establish two T2T evaluation benchmarks: the Objaverse T2T benchmark and the GSO T2T benchmark, based on selected high-quality 3D meshes from the Objaverse dataset and the entire GSO dataset, respectively. Extensive experimental results demonstrate that MVPaint surpasses existing state-of-the-art methods. Notably, MVPaint could generate high-fidelity textures with minimal Janus issues and highly enhanced cross-view consistency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Gpt-4 technical report. In arXiv preprint arXiv:2303.08774, 2023.
  2. Stability AI. Stable diffusion 2, 2022. Accessed: 2024-10-24.
  3. Multidiffusion: fusing diffusion paths for controlled image generation. In ICML, 2023.
  4. Meta 3d texturegen: Fast and consistent texture generation for 3d objects. In arXiv preprint arXiv:2407.02430, 2024.
  5. Demystifying mmd gans. In ICLR, 2018.
  6. Text2tex: Text-driven texture synthesis via diffusion models. In CVPR, 2023.
  7. Meshxl: Neural coordinate field for generative 3d foundation models. In arXiv preprint arXiv:2405.20853, 2024a.
  8. Meshanything: Artist-created mesh generation with autoregressive transformers. In arXiv preprint arXiv:2406.10163, 2024b.
  9. Meshanything v2: Artist-created mesh generation with adjacent mesh tokenization. In arXiv preprint arXiv:2408.02555, 2024c.
  10. It3d: Improved text-to-3d generation with explicit view synthesis. In AAAI, 2024d.
  11. Emu: Enhancing image generation models using photogenic needles in a haystack. In arXiv preprint arXiv:2309.15807, 2023.
  12. Objaverse: A universe of annotated 3d objects. In CVPR, 2023.
  13. Flashtex: Fast relightable mesh texturing with lightcontrolnet. In ECCV, 2024.
  14. Diffusers. Controlnet depth sdxl 1.0. https://huggingface.co/diffusers/controlnet-depth-sdxl-1.0, 2023.
  15. Cogview: Mastering text-to-image generation via transformers. In NeurIPS, 2021.
  16. Google scanned objects: A high-quality dataset of 3d scanned household items. In ICRA, 2022.
  17. Make-a-scene: Scene-based text-to-image generation with human priors. In ECCV, 2022.
  18. Generative adversarial nets. In NeurIPS, 2014.
  19. CLIPScore: a reference-free evaluation metric for image captioning. In EMNLP, 2021.
  20. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  21. Denoising diffusion probabilistic models. In NeurIPS, 2020.
  22. Cascaded diffusion models for high fidelity image generation. In JMLR, 2022.
  23. Cogvlm2: Visual language models for image and video understanding. In arXiv preprint arXiv:2408.16500, 2024.
  24. Lora: Low-rank adaptation of large language models. In ICML, 2022.
  25. Adversarial texture optimization from rgb-d scans. In CVPR, 2020.
  26. Mistral 7b. In arXiv preprint arXiv:2310.06825, 2023.
  27. Flexitex: Enhancing texture generation with visual guidance. In arXiv preprint arXiv:2409.12431, 2024.
  28. Direct visibility of point sets. In SIGGRAPH, 2007.
  29. Solid texture synthesis from 2d exemplars. In SIGGRAPH, 2007.
  30. Syncdiffusion: Coherent montage via synchronized joint diffusions. In NeurIPS, 2023.
  31. Appearance-space texture synthesis. In ACM TOG, 2006.
  32. Mvcontrol: Adding conditional control to multi-view diffusion for controllable text-to-3d generation. In arXiv preprint arXiv:2311.14494, 2023.
  33. Zero-1-to-3: Zero-shot one image to 3d object. In ICCV, 2023a.
  34. Text-guided texturing by synchronized multi-view diffusion. In arXiv preprint arXiv:2311.12891, 2023b.
  35. Wonder3d: Single image to 3d using cross-domain diffusion. In CVPR, 2024.
  36. Mehdi Mirza. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  37. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia, 2022.
  38. Easi-tex: Edge-aware mesh texturing from single image. In ACM TOG, 2024.
  39. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In arXiv preprint arXiv:2307.01952, 2023.
  40. Dreamfusion: Text-to-3d using 2d diffusion. In ICLR, 2023.
  41. Alec Radford. Unsupervised representation learning with deep convolutional generative adversarial networks. In arXiv preprint arXiv:1511.06434, 2015.
  42. Learning transferable visual models from natural language supervision. In ICML, 2021.
  43. Texture: Text-guided texturing of 3d shapes. In SIGGRAPH, 2023.
  44. High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
  45. Clip-forge: Towards zero-shot text-to-shape generation. In CVPR, 2022.
  46. Zero123++: a single image to consistent multi-view diffusion base model. In arXiv preprint arXiv:2310.15110, 2023.
  47. Mvdream: Multi-view diffusion for 3d generation. In ICLR, 2024.
  48. Denoising diffusion implicit models. In ICLR, 2021.
  49. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. In ICLR, 2024.
  50. Greg Turk. Texture synthesis on surfaces. In SIGGRAPH, 2001.
  51. A Vaswani. Attention is all you need. In NeurIPS, 2017.
  52. Imagedream: Image-prompt multi-view diffusion for 3d generation. In arXiv preprint arXiv:2312.02201, 2023.
  53. Texture synthesis over arbitrary manifold surfaces. In SIGGRAPH, 2001.
  54. State of the art in example-based texture synthesis. In Eurographics STAR, 2009.
  55. Unique3d: High-quality and efficient 3d mesh generation from a single image. In arXiv preprint arXiv:2405.20343, 2024.
  56. Xinsir. Controlnet tile sdxl 1.0. https://huggingface.co/xinsir/controlnet-tile-sdxl-1.0, 2023.
  57. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. In arXiv preprint arXiv:2308.06721, 2023.
  58. Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models. In 3DV, 2024.
  59. Jonathan Young. xatlas: A Library for Mesh Parameterization. GitHub repository, 2018.
  60. Paint3d: Paint anything 3d with lighting-less texture diffusion models. In CVPR, 2024.
  61. Texpainter: Generative mesh texturing with multi-view consistency. In SIGGRAPH, 2024a.
  62. Adding conditional control to text-to-image diffusion models. In CVPR, 2023.
  63. Clay: A controllable large-scale generative model for creating high-quality 3d assets. In ACM TOG, 2024b.
  64. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.

Summary

  • The paper presents the MVPaint framework that overcomes multi-view inconsistencies and the Janus problem through synchronized diffusion for 3D texture generation.
  • It introduces a three-stage process—synchronized multi-view generation, spatial-aware 3D inpainting, and UV refinement—to address challenges in UV mapping and texture completion.
  • Experimental evaluations on Objaverse T2T and GSO T2T benchmarks demonstrate its superior performance, paving the way for advanced 3D asset creation in VR and animation.

Insights on Synchronized Multi-View Diffusion for 3D Texture Generation

The paper presents MVPaint, a comprehensive framework for generating high-fidelity 3D textures from textual descriptions. It addresses contemporary challenges in 3D texturing, particularly the need for seamless texture generation across multiple views with minimal reliance on UV unwrapping quality. This paper is divided into innovative stages, contributing significantly to the field of large-scale 3D asset production.

MVPaint's architecture addresses fundamental issues in existing methods, such as local discontinuities and Janus problems, which arise from handling multiple independent view generations. The experimental results underscore the model's effectiveness in generating consistent and detailed textures, outperforming existing state-of-the-art techniques.

Key Components of MVPaint

MVPaint operates through three major stages:

  1. Synchronized Multi-View Generation (SMG): This initial process employs a multi-view diffusion model to simultaneously generate multi-view images from a text prompt. Leveraging cross-attention and synchronization with the UV space, this stage minimizes the Janus problem by ensuring consistent low-resolution multi-view images. The synchronization in image space alleviates issues associated with latent space operations, which often suffer from UV mapping complications.
  2. Spatial-aware 3D Inpainting (S3I): Following initial image generation, MVPaint handles texture completion through an inpainting method that leverages spatial relationships among 3D points derived from mesh surfaces. This learning-free approach propagates color across 3D points, demonstrating robustness against complex UV unwrapping and occlusion issues.
  3. UV Refinement (UVR): The final stage embellishes the rough 3D texture, conducting super-resolution and spatial-aware seam smoothing to upgrade the texture map to higher resolutions. This provides consistency and detail, particularly where previous stages might introduce discrepancies due to UV mapping irregularities.

Evaluation and Implications

Two benchmarks, Objaverse T2T and GSO T2T, were established to evaluate the framework. They underline the model's superior performance in ensuring cross-view consistency and handling diverse textures, as reflected in metrics like FID, KID, and user studies. The model showcases significant advancement over previous methods such as Paint3D, SyncMVD, and TEXTure, mainly due to its emphasis on comprehensive UV refinement and geometry-aware synthesis.

MVPaint facilitates broader applications by allowing texturing of generative models from AI systems like MeshXL or MeshAnything. Its ability to address traditional challenges extends it beyond game design into areas like virtual reality and animation, promising more detailed and lifelike 3D asset generation.

Limitations and Future Work

The paper acknowledges potential avenues for enhancement, such as improving aesthetic qualities beyond the baseline established by models like SDXL. While the current implementation focuses on text prompts for texture guidance, expanding to image prompts through advanced modifications, such as adaptable diffusion models or integrating image-to-multiview approaches, can enhance the versatility of the framework.

The development of MVPaint signifies a pivotal step towards more efficient, consistent texture modeling in 3D environments. Its robust architectural framework provides a foundational platform that encourages further exploration and adaptation in the rapidly evolving sector of 3D content creation, particularly in AI-driven texture generation.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 9 tweets and received 229 likes.

Upgrade to Pro to view all of the tweets about this paper: