Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data (2503.21694v1)

Published 27 Mar 2025 in cs.GR, cs.AI, and cs.CV

Abstract: It is highly desirable to obtain a model that can generate high-quality 3D meshes from text prompts in just seconds. While recent attempts have adapted pre-trained text-to-image diffusion models, such as Stable Diffusion (SD), into generators of 3D representations (e.g., Triplane), they often suffer from poor quality due to the lack of sufficient high-quality 3D training data. Aiming at overcoming the data shortage, we propose a novel training scheme, termed as Progressive Rendering Distillation (PRD), eliminating the need for 3D ground-truths by distilling multi-view diffusion models and adapting SD into a native 3D generator. In each iteration of training, PRD uses the U-Net to progressively denoise the latent from random noise for a few steps, and in each step it decodes the denoised latent into 3D output. Multi-view diffusion models, including MVDream and RichDreamer, are used in joint with SD to distill text-consistent textures and geometries into the 3D outputs through score distillation. Since PRD supports training without 3D ground-truths, we can easily scale up the training data and improve generation quality for challenging text prompts with creative concepts. Meanwhile, PRD can accelerate the inference speed of the generation model in just a few steps. With PRD, we train a Triplane generator, namely TriplaneTurbo, which adds only $2.5\%$ trainable parameters to adapt SD for Triplane generation. TriplaneTurbo outperforms previous text-to-3D generators in both efficiency and quality. Specifically, it can produce high-quality 3D meshes in 1.2 seconds and generalize well for challenging text input. The code is available at https://github.com/theEricMa/TriplaneTurbo.

Summary

Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation without 3D Data

The paper presents a novel approach to generating high-quality 3D meshes from text prompts. This approach, called Progressive Rendering Distillation (PRD), eliminates the need for 3D ground-truth data by leveraging advanced diffusion models to achieve fast and reliable text-to-mesh generation using Stable Diffusion (SD) as a conceptual foundation. Text-to-mesh conversion is frequently plagued by the scarcity of high-quality 3D training data, which inherently limits the efficacy of existing solutions. The authors propose PRD to elegantly bypass this limitation by distilling multi-view diffusion models to adapt SD as a native 3D generator.

The paper's central innovation lies in its training scheme, PRD, which does not require 3D ground truth. PRD employs multi-view teacher models such as MVDream and RichDreamer in conjunction to facilitate the adaptation of SD into a 3D generator. The systematic training process involves distilling these models into 3D output through score distillation techniques and progressively denoising latent spaces. This is achieved by leveraging a U-Net architecture that intentionally removes noise from latent variables in incremental steps, ultimately achieving fidelity in 3D mesh generation.

Another significant contribution is the introduction of a specialized generator, TriplaneTurbo. This generator effectively transforms SD into a native 3D model, enhancing generation capability while only requiring the addition of 2.5% trainable parameters. This efficiency is exciting given that TriplaneTurbo surpasses existing models in both quality and speed, generating high-fidelity 3D meshes in just 1.2 seconds. PRD's flexibility and efficiency are emphasized by its ability to execute without relying on 3D data, offering a solution that is not only superior in speed but also scalable in application.

A noteworthy aspect of the implementation is the Parameter-Efficient Triplane Adapter (PETA). This module facilitates the efficient conversion of 2D diffusion models into 3D models. Through minimal parameter augmentation, PETA ensures the network adapts effectively for 3D generation, driving the integration of SD's capabilities into the Triplane framework with remarkable efficiency.

The theoretical implications of this research span numerous facets of AI. By disassociating 3D generation training from traditional data requirements, the research addresses a fundamental challenge in AI-driven 3D model creation. Practically, this accelerates the development timeline and broadens potential applications. Future developments might involve extending the adaptability of SD-based architectures to encompass broader domains like 3D scene generation.

Despite the notable results, the paper acknowledges limitations in specific generative aspects, notably in generating multiple distinct objects or enhancing detail in complex characters. Improvements in multi-view teacher models could address these limitations, complementing PRD's robust framework.

In conclusion, this work presents a groundbreaking methodology that challenges the prevailing conventions of text-to-3D mesh generation. By efficiently adapting Stable Diffusion into a 3D generator through data-independent distillation methods, it paves the way for scalable, high-quality 3D content creation that thrives under minimal resource dependency. This advancement not only holds promise for innovating 3D graphics but also sets a precedent for future AI systems seeking efficiency and scope in generative contexts.

Related Papers

GitHub

GitHub - theEricMa/TriplaneTurbo: This is the official repository for Progressive Rendering Distillation: Adapting Stable Diffusion for Instant Text-to-Mesh Generation beyond 3D Training Data [CVPR2025] (11 stars)

Tweets

https://twitter.com/_akhaliq/status/1907105207428780470