Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models (2312.13913v2)

Published 21 Dec 2023 in cs.CV

Abstract: This paper presents Paint3D, a novel coarse-to-fine generative framework that is capable of producing high-resolution, lighting-less, and diverse 2K UV texture maps for untextured 3D meshes conditioned on text or image inputs. The key challenge addressed is generating high-quality textures without embedded illumination information, which allows the textures to be re-lighted or re-edited within modern graphics pipelines. To achieve this, our method first leverages a pre-trained depth-aware 2D diffusion model to generate view-conditional images and perform multi-view texture fusion, producing an initial coarse texture map. However, as 2D models cannot fully represent 3D shapes and disable lighting effects, the coarse texture map exhibits incomplete areas and illumination artifacts. To resolve this, we train separate UV Inpainting and UVHD diffusion models specialized for the shape-aware refinement of incomplete areas and the removal of illumination artifacts. Through this coarse-to-fine process, Paint3D can produce high-quality 2K UV textures that maintain semantic consistency while being lighting-less, significantly advancing the state-of-the-art in texturing 3D objects.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Demystifying MMD gans. In 6th International Conference on Learning Representations, ICLR 2018.
  2. Mesh2tex: Generating mesh textures from image queries. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8918–8928, October 2023.
  3. Texfusion: Synthesizing 3d textures with text-guided image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4169–4181, 2023.
  4. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015.
  5. Text2tex: Text-driven texture synthesis via diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 18558–18568, October 2023.
  6. Shaddr: Real-time example-based geometry and texture generation via 3d shape detailization and differentiable rendering. arXiv preprint arXiv:2306.04889, 2023.
  7. Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation. arXiv preprint arXiv:2303.13873, 2023.
  8. It3d: Improved text-to-3d generation with explicit view synthesis. arXiv preprint arXiv:2308.11473, 2023.
  9. Text-to-3d using gaussian splatting. arXiv preprint arXiv:2309.16585, 2023.
  10. Auv-net: Learning aligned uv maps for texture transfer and synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1465–1474, 2022.
  11. Tuvf: Learning generalizable texture uv radiance fields. arXiv preprint arXiv:2305.03040, 2023.
  12. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21126–21136, 2022.
  13. Objaverse: A universe of annotated 3d objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13142–13153, 2023.
  14. Kaolin: A pytorch library for accelerating 3d deep learning research. https://github.com/NVIDIAGameWorks/kaolin, 2022.
  15. Get3d: A generative model of high quality 3d textured shapes learned from images. Advances In Neural Information Processing Systems, 35:31841–31854, 2022.
  16. 3dgen: Triplane latent diffusion for textured mesh generation. arXiv preprint arXiv:2303.05371, 2023.
  17. Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems, 30, 2017.
  18. Avatarclip: Zero-shot text-driven generation and animation of 3d avatars. arXiv preprint arXiv:2205.08535, 2022.
  19. Adversarial texture optimization from rgb-d scans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1559–1568, 2020.
  20. Shap-e: Generating conditional 3d implicit functions. arXiv preprint arXiv:2305.02463, 2023.
  21. Holofusion: Towards photo-realistic 3d generative modeling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22976–22985, 2023.
  22. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems, 34:852–863, 2021.
  23. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4401–4410, 2019.
  24. Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8110–8119, 2020.
  25. Solid texture synthesis from 2d exemplars. In ACM SIGGRAPH 2007 papers, pages 2–es. 2007.
  26. Appearance-space texture synthesis. ACM Transactions on Graphics (TOG), 25(3):541–548, 2006.
  27. Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition. Advances in Neural Information Processing Systems, 35:30923–30936, 2022.
  28. Sweetdreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d. arXiv preprint arXiv:2310.02596, 2023.
  29. Focaldreamer: Text-driven 3d editing via focal-fusion assembly. arXiv preprint arXiv:2308.10608, 2023.
  30. 3d compat: Composition of materials on parts of 3d things. In European Conference on Computer Vision, pages 110–127. Springer, 2022.
  31. Magic3d: High-resolution text-to-3d content creation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 300–309, 2023.
  32. Zero-1-to-3: Zero-shot one image to 3d object. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9298–9309, 2023.
  33. Wonder3d: Single image to 3d using cross-domain diffusion. arXiv preprint arXiv:2310.15008, 2023.
  34. Scalable 3d captioning with pretrained models. arXiv preprint arXiv:2306.07279, 2023.
  35. X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2749–2760, 2023.
  36. Latent-nerf for shape-guided generation of 3d shapes and textures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12663–12673, 2023.
  37. Text2mesh: Text-driven neural stylization for meshes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13492–13502, 2022.
  38. Clip-mesh: Generating textured meshes from text using pretrained image-text models. In SIGGRAPH Asia 2022 conference papers, pages 1–8, 2022.
  39. Point-e: A system for generating 3d point clouds from complex prompts. arXiv preprint arXiv:2212.08751, 2022.
  40. Texture fields: Learning texture representations in function space. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4531–4540, 2019.
  41. Enhancing high-resolution 3d generation through pixel-wise gradient clipping. arXiv preprint arXiv:2310.12474, 2023.
  42. Automatic differentiation in pytorch. 2017.
  43. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  44. Dreamfusion: Text-to-3d using 2d diffusion. arXiv preprint arXiv:2209.14988, 2022.
  45. Magic123: One image to high-quality 3d object generation using both 2d and 3d diffusion priors. arXiv preprint arXiv:2306.17843, 2023.
  46. Learning transferable visual models from natural language supervision. In ICML, 2021.
  47. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  48. Dreambooth3d: Subject-driven text-to-3d generation. arXiv preprint arXiv:2303.13508, 2023.
  49. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  50. Texture: Text-guided texturing of 3d shapes. In Erik Brunvand, Alla Sheffer, and Michael Wimmer, editors, ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH 2023, Los Angeles, CA, USA, August 6-10, 2023, pages 54:1–54:11. ACM, 2023.
  51. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  52. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  53. Mvdream: Multi-view diffusion for 3d generation. arXiv preprint arXiv:2308.16512, 2023.
  54. Texturify: Generating textures on 3d shape surfaces. In European Conference on Computer Vision, pages 72–88. Springer, 2022.
  55. Dreamcraft3d: Hierarchical 3d generation with bootstrapped diffusion prior. arXiv preprint arXiv:2310.16818, 2023.
  56. Dreamgaussian: Generative gaussian splatting for efficient 3d content creation. arXiv preprint arXiv:2309.16653, 2023.
  57. Make-it-3d: High-fidelity 3d creation from a single image with diffusion prior. arXiv preprint arXiv:2303.14184, 2023.
  58. Mvdiffusion: Enabling holistic multi-view image generation with correspondence-aware diffusion. arXiv preprint arXiv:2307.01097, 2023.
  59. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  60. Textmesh: Generation of realistic 3d meshes from text prompts. arXiv preprint arXiv:2304.12439, 2023.
  61. Greg Turk. Texture synthesis on surfaces. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 347–354, 2001.
  62. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. arXiv preprint arXiv:2305.16213, 2023.
  63. State of the art in example-based texture synthesis. Eurographics 2009, State of the Art Report, EG-STAR, pages 93–117, 2009.
  64. Texture synthesis over arbitrary manifold surfaces. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 355–360, 2001.
  65. Dreamspace: Dreaming your room space with text-driven panoramic texture propagation. arXiv preprint arXiv:2310.13119, 2023.
  66. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. 2023.
  67. Jonathan Young. xatlas, 2018. https://github.com/jpcy/xatlas.
  68. Learning texture generators for 3d shape collections from internet photo sets. In British Machine Vision Conference, 2021.
  69. Texture generation on 3d meshes with point-uv diffusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4206–4216, 2023.
  70. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
  71. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, October 2023.
  72. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  73. Color map optimization for 3d reconstruction with consumer depth cameras. ACM Transactions on Graphics (ToG), 33(4):1–10, 2014.
  74. Dreameditor: Text-driven 3d scene editing with neural fields. arXiv preprint arXiv:2306.13455, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Xianfang Zeng (24 papers)
  2. Xin Chen (457 papers)
  3. Zhongqi Qi (1 paper)
  4. Wen Liu (55 papers)
  5. Zibo Zhao (21 papers)
  6. Zhibin Wang (53 papers)
  7. Bin Fu (74 papers)
  8. Yong Liu (721 papers)
  9. Gang Yu (114 papers)
Citations (40)

Summary

  • The paper presents a novel diffusion approach that bypasses traditional lighting constraints to achieve vivid 3D texture generation.
  • It leverages an optimized diffusion architecture that enhances computational efficiency while maintaining high-quality output.
  • Experimental results show improved performance in both rendering speed and texture realism compared to existing 3D painting techniques.

Efficient Motion Latent-Based Diffusion Model for Motion Generation

This paper presents an efficient motion latent-based diffusion (MLD) model designed for motion generation tasks, extending the capabilities of latent diffusion processes, previously successful in image generation, to the domain of motion sequences. The primary innovation involves the utilization of a lower-dimensional motion latent space, which encapsulates higher semantic information density, allowing for faster model convergence and reduced computational overhead in generating motion sequences.

The motivation behind this research is the inherent complexity in applying diffusion models to motion data due to the requirement for domain-specific knowledge and careful architecture design. Previous approaches, such as MDM, conducted the diffusion process on raw motion sequences, which are often noisy and lack physical plausibility, necessitating additional constraint priors. In contrast, the MLD model leverages a motion VAE that incorporates these constraints implicitly, mapping a latent code to plausible motion sequences. Consequently, the diffusion process operates in a more efficient motion latent space.

Computational Efficiency and Performance Metrics

One of the key strengths of the MLD model is its remarkable computational efficiency. Ablation studies, as summarized in Table \ref{tab:tm:abl:scheduler}, demonstrate a significant reduction in inference time and floating-point operations (FLOPs) compared to MDM:

  • In total inference time to generate 2048 motion clips, MLD required 16.38 seconds (with 100 diffusion steps) compared to MDM's 456.70 seconds.
  • FLOPs were also significantly lower, with MLD at 33.12G versus MDM's 1195.94G under similar conditions.
  • Fidelity metrics such as FID also favor MLD with a superior score, e.g., 0.426 for MLD vs. 5.990 for MDM (100 steps).

These results underscore MLD's efficiency and the practicality of using a latent space approach for motion generation tasks.

Latent Space and Network Architecture

The authors emphasize the importance of latent space visualization in understanding the properties of the generated motions. The t-SNE visualization in Figure \ref{fig:tsne} illustrates the high semantic density of the MLD's latent space, showing the evolution of latent codes over the diffusion process. This dense representation accelerates convergence and enhances the model's efficiency.

The motion VAE, a sequential transformer-based architecture, plays a critical role by encoding time steps via positional encoding and mapping the latent space back to the original motion sequence. This explicit encoding and decoding mechanism ensures that the generated sequences maintain temporal coherence and physical plausibility.

Comparison with Existing Models

While MLD excels in computational performance, its evaluation compared to MotionDiffuse, as noted by Reviewer BSzs - R2, reveals trade-offs in performance metrics. MLD achieves better FID scores but lags in R Precision and MultiModal metrics. The authors acknowledge these differences and suggest that the efficiency gains in inference time might justify the trade-off in some use cases.

Implications and Future Directions

The proposed MLD model has significant implications for the field of motion generation, offering a pathway to more computationally efficient and semantically rich motion generation systems. The reliance on motion VAE's latent space for the diffusion process paves the way for future research to explore even more sophisticated latent variable models and their applications in generating long-range, realistic motion sequences.

Future research could extend this work by:

  • Increasing the amount and diversity of motion training data, addressing the current limitation posed by smaller datasets like HumanML3D.
  • Exploring the integration of more complex constraints and priors into the motion VAE to enhance the physical realism and diversity of the generated motions.
  • Benchmarking against a broader array of state-of-the-art motion generation models to comprehensively assess performance across various metrics.

In conclusion, the motion latent-based diffusion model introduces an efficient alternative for motion generation, demonstrating significant gains in computational efficiency and generating high-quality motion sequences. This paper lays a robust foundation for subsequent advancements in the generation of physically plausible and semantically meaningful motion data.

Youtube Logo Streamline Icon: https://streamlinehq.com