One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion (2311.07885v1)

Published 14 Nov 2023 in cs.CV, cs.AI, and cs.GR

Abstract: Recent advancements in open-world 3D object generation have been remarkable, with image-to-3D methods offering superior fine-grained control over their text-to-3D counterparts. However, most existing models fall short in simultaneously providing rapid generation speeds and high fidelity to input images - two features essential for practical applications. In this paper, we present One-2-3-45++, an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. Our approach aims to fully harness the extensive knowledge embedded in 2D diffusion models and priors from valuable yet limited 3D data. This is achieved by initially finetuning a 2D diffusion model for consistent multi-view image generation, followed by elevating these images to 3D with the aid of multi-view conditioned 3D native diffusion models. Extensive experimental evaluations demonstrate that our method can produce high-quality, diverse 3D assets that closely mirror the original input image. Our project webpage: https://sudo-ai-3d.github.io/One2345plus_page.

Citations (134)

View on Semantic Scholar

Summary

The paper introduces a fast method for converting a single 2D image into a detailed 3D textured mesh by ensuring consistent multi-view generation.
It integrates fine-tuning of 2D diffusion models with a two-stage 3D diffusion process to enhance geometric fidelity and texture quality.
Experiments on GSO and Objaverse datasets show significant improvements in F-Score, CLIP similarity, and user preference over existing methods.

An Analysis of "One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion"

The paper "One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion" presents an innovative approach for converting a single 2D image into a detailed 3D textured mesh within approximately one minute. This research addresses the challenges faced by existing image-to-3D methods, which often struggle to balance generation speed with fidelity to the input image.

Methodology

Consistent Multi-View Generation:

The authors introduce a method that employs fine-tuning of 2D diffusion models for multi-view image generation, ensuring consistency across different views. By tiling a set of six-view images into a single image, the model generates cohesive multi-view images, enhancing the subsequent 3D reconstruction process. This approach uses fixed absolute elevation angles with relative azimuth angles to define camera poses, which resolves orientation ambiguities without additional elevation evaluations.

3D Diffusion Process:

The research leverages a two-stage 3D diffusion model. A coarse-to-fine strategy generates a full 3D occupancy volume followed by a high-resolution sparse volume of SDF values and colors. This model is conditioned on multi-view images, which provides a comprehensive guide to lift 2D representations into 3D. The use of multi-view conditioned 3D diffusion networks significantly bolsters the robustness and generalization capabilities compared to previous generalizable NeRF-based methods.

Texture Refinement:

Final texture refinement is achieved through a lightweight optimization method utilizing multi-view images as supervision. This process ensures the enhanced quality of textures while maintaining efficient computational demands.

Experimental Results

The authors conduct comprehensive evaluations on the GSO and Objaverse datasets, demonstrating superiority in terms of both geometric fidelity and visual quality. One-2-3-45++ achieves impressive F-Score and CLIP similarity metrics, significantly outperforming both optimization-based and feed-forward methods. The user paper results further emphasize its superiority, with notable improvements in user preference scores.

Implications and Future Directions

The results suggest practical implications for game development and virtual reality, where rapid and precise conversion from 2D to 3D is invaluable. The method offers potential for broader applicability, such as augmented reality and robotics, where real-time 3D generation can enhance interaction with dynamic environments.

Looking forward, integrating more comprehensive guiding conditions from 2D diffusion models could further improve geometry robustness and detail. Exploring additional domain-specific priors and conditions could make the model applicable to a wider range of applications, particularly those requiring higher levels of detail and accuracy.

In conclusion, One-2-3-45++ marks a significant advancement in 3D generation from a single image and sets the stage for further exploration into efficient, high-fidelity 3D content creation. The paper's approach in harnessing multi-view consistency and 3D diffusion models presents a promising direction for addressing current limitations in image-based 3D reconstruction.

PDF Markdown

Related Papers

GitHub

One-2-3-45++
GitHub - SUDO-AI-3D/One2345plus (493 stars)