VFusion3D: Learning Scalable 3D Generative Models from Video Diffusion Models (2403.12034v2)

Published 18 Mar 2024 in cs.CV, cs.GR, and cs.LG

Abstract: This paper presents a novel method for building scalable 3D generative models utilizing pre-trained video diffusion models. The primary obstacle in developing foundation 3D generative models is the limited availability of 3D data. Unlike images, texts, or videos, 3D data are not readily accessible and are difficult to acquire. This results in a significant disparity in scale compared to the vast quantities of other types of data. To address this issue, we propose using a video diffusion model, trained with extensive volumes of text, images, and videos, as a knowledge source for 3D data. By unlocking its multi-view generative capabilities through fine-tuning, we generate a large-scale synthetic multi-view dataset to train a feed-forward 3D generative model. The proposed model, VFusion3D, trained on nearly 3M synthetic multi-view data, can generate a 3D asset from a single image in seconds and achieves superior performance when compared to current SOTA feed-forward 3D generative models, with users preferring our results over 90% of the time.

Citations (27)

View on Semantic Scholar

Summary

The paper introduces VFusion3D, a novel approach repurposing video diffusion models to generate large-scale, high-quality 3D data.
It employs fine-tuning of an EMU Video model and multi-stage training to produce 2.7 million synthetic multi-view videos that enhance both speed and quality.
Evaluation demonstrates over 70% user preference and rapid single-image 3D reconstruction, highlighting its strong practical potential.

VFusion3D: A Scalable Approach for 3D Generative Modeling Utilizing Pre-trained Video Diffusion Models

Introduction

The recent advancements in 3D datasets and neural rendering techniques have opened new horizons in computer vision and graphics, spawning a burgeoning interest in the development of foundation 3D generative models capable of producing high-quality 3D assets. However, the scarcity of accessible 3D data has been a significant bottleneck, limiting the potential scale compared to the vast data available in other domains. This paper introduces VFusion3D, a novel methodology that leverages pre-trained video diffusion models to generate large-scale synthetic multi-view datasets. This approach facilitates the training of feed-forward 3D generative models that surpass current state-of-the-art (SOTA) models in both speed and quality, as evidenced by a user preference rate of over 70%.

The Challenge of Scarcity in 3D Data

Collecting and accessing high-quality 3D data is an inherently challenging task, fraught with limitations in both the quantity and quality of available assets. The largest public repositories offer a finite collection that often includes duplicate assets, lacking in texture or detail. This stark contrast with the data-rich environments of other foundation models has hindered progress in developing scalable and efficient 3D generative models.

VFusion3D: Methodology

VFusion3D addresses the data scarcity challenge by repurposing an EMU Video diffusion model, originally trained on a diverse corpus of texts, images, and videos, as a 3D data generator. This is accomplished by:

Fine-tuning EMU Video with rendered multi-view videos from an internal dataset of 100K 3D objects crafted by artists, transforming it into a powerful multi-view video generator.
Generating Synthetic Multi-View Data on a large scale by employing the fine-tuned EMU Video model, producing 2.7 million high-quality, 3D-consistent multi-view videos.

Training VFusion3D

VFusion3D adopts the architecture of the Large Reconstruction Model (LRM) and is trained using the generated synthetic dataset. A series of improved training strategies, including a multi-stage training approach, image-level supervision, and camera noise injection, are introduced to adapt the LRM for this unique training context. Additionally, fine-tuning VFusion3D with a subset of the initial 3D dataset further enhances its performance.

Evaluation and Results

VFusion3D demonstrates superior performance against several distillation-based and feed-forward 3D generative models using both user studies and automated metrics. Notably, it manages to construct 3D models from a single image with remarkable speed and accuracy.

Implications and Future Directions

This paper showcases a promising avenue toward solving the foundational challenge of data scarcity in the field of 3D generative models. VFusion3D not only achieves impressive results but also opens up the possibility of scaling 3D model training to unprecedented levels. Future research could explore the integration of advances in video diffusion models, the expansion of 3D data availability, and the refinement of 3D feed-forward generative architectures. The scalability of VFusion3D indicates a significant step forward in our journey towards foundation 3D generative models, setting the stage for a wide array of applications in computer graphics, virtual reality, and beyond.

Analysis and Discussion

The paper provides a thorough analysis comparing 3D data with synthetic multi-view data, illustrating the strengths and limitations of each. The scalability trends observed with VFusion3D reaffirm the potential for synthetic data to play a pivotal role in the advancement of 3D generative models. However, challenges such as the generation of specific object categories, like vehicles, highlight areas for future improvement.

Conclusion

VFusion3D represents a novel and effective strategy to circumvent the limitations posed by the scarcity of 3D data. By leveraging the capabilities of video diffusion models as a multi-view data generator, this approach unlocks new possibilities for the large-scale training of 3D generative models. The promising results and scalability of VFusion3D hint at a bright future for foundation 3D modeling, paving the way for more complex and diverse 3D content creation.