One-2-3-45: Any Single Image to 3D Mesh in 45 Seconds without Per-Shape Optimization

Published 29 Jun 2023 in cs.CV, cs.AI, and cs.RO | (2306.16928v1)

Abstract: Single image 3D reconstruction is an important but challenging task that requires extensive knowledge of our natural world. Many existing methods solve this problem by optimizing a neural radiance field under the guidance of 2D diffusion models but suffer from lengthy optimization time, 3D inconsistency results, and poor geometry. In this work, we propose a novel method that takes a single image of any object as input and generates a full 360-degree 3D textured mesh in a single feed-forward pass. Given a single image, we first use a view-conditioned 2D diffusion model, Zero123, to generate multi-view images for the input view, and then aim to lift them up to 3D space. Since traditional reconstruction methods struggle with inconsistent multi-view predictions, we build our 3D reconstruction module upon an SDF-based generalizable neural surface reconstruction method and propose several critical training strategies to enable the reconstruction of 360-degree meshes. Without costly optimizations, our method reconstructs 3D shapes in significantly less time than existing methods. Moreover, our method favors better geometry, generates more 3D consistent results, and adheres more closely to the input image. We evaluate our approach on both synthetic data and in-the-wild images and demonstrate its superiority in terms of both mesh quality and runtime. In addition, our approach can seamlessly support the text-to-3D task by integrating with off-the-shelf text-to-image diffusion models.

Abstract PDF HTML Upgrade to Chat

Authors (7)

References (95)

Citations (342)

View on Semantic Scholar

Summary

The paper demonstrates a method that reconstructs 3D textured meshes from a single image in 45 seconds without per-shape optimization.
It integrates Zero123 diffusion with sparse neural representations to achieve robust multi-view synthesis via a direct, feed-forward process.
Experimental results reveal superior geometric accuracy and texture fidelity, advancing real-time applications in AR/VR and robotics.

Single Image 3D Reconstruction with Zero123 and SDF-Based Techniques

This paper presents a method, termed One-2-3-45, focused on the task of reconstructing a 3D textured mesh from a single image in a span of 45 seconds without requiring per-shape optimization. It builds on the integration of a 2D diffusion model, Zero123, with a custom 3D reconstruction framework. This task is nontrivial due to the inherent challenge of predicting the 3D structure for portions of the scene not visible in the single input image. By aiming for more consistent geometric results across different views, the authors contribute to the broader field of computer vision, 3D modeling, and immersive technologies such as AR/VR.

Methodology

The approach involves three core components: multi-view synthesis, camera pose estimation, and 3D reconstruction. The initial step harnesses Zero123, a fine-tuned version of Stable Diffusion, to generate multi-view images from a single input image based on relative camera transformations. This step exploits the extensive prior knowledge encoded in the 2D diffusion models, particularly effective due to their training on large-scale internet datasets.

A key innovation lies in the 3D reconstruction component, which avoids traditional optimization methods, known for their computational intensiveness and time-consuming nature. Instead, the authors leverage Sparse Neural Representation (SparseNeuS), a framework extending the capabilities of neural scene representations like Multi-View Stereo (MVS). By utilizing an SDF-based rendering technique, the paper reports significant improvements in mesh quality and geometric consistency.

To address the limitations of existing methods in geometry fidelity and runtime, the authors design a reconstruction network trained to manage imperfect multi-view predictions. The proposed system replaces the costly iterative optimization with a direct, feed-forward inference, allowing for rapid 3D mesh reconstruction that correlates more faithfully to the input image.

The paper also addresses the often-ignored aspect of camera pose estimation for the input image necessary for satisfactory 3D lifting. By estimating elevation from synthesized nearby views, the proposed approach leverages an effective heuristic for approximating the appropriate camera settings in the canonical spherical coordinate frame employed by Zero123.

Results

The experimental evaluation includes both qualitative and quantitative assessments, demonstrating notable improvements over competing methods like Point-E, Shap-E, and others. In terms of F-Score, representing geometric accuracy, One-2-3-45 shows superior performance across various datasets, such as Objaverse and GoogleScannedObjects, underscoring its capability to generalize across diverse classes of objects. Moreover, the results adhere closely to the input image, a reflection of the strength in leveraging data-driven visual priors from the diffusion models.

Implications and Future Work

The proposed technique provides a more efficient pipeline for 3D reconstruction, suggesting potential applications in real-time 3D content creation and robotics where rapid environmental mapping is essential. Practically, the method can serve as a foundational tool for applications requiring real-time scene understanding and manipulation, such as in autonomous driving or augmented reality systems.

On a theoretical level, the marriage between 2D diffusion models and 3D reconstruction offers insight into how learned priors from large-scale 2D datasets can be effectively transferred to 3D tasks. This work could catalyze further research into the development of hybrid models that incorporate multi-modal data to bolster both the accuracy and applicability of computer-generated models in complex real-world scenarios.

Further enhancements could focus on tackling the few remaining challenges, such as consistent texture application and handling occlusions in more complex scenes. Additionally, exploring automated methods for scene understanding could expand this framework's utility in dynamic and non-static environments.

Overall, the blend of innovative model design and efficient processing presented in One-2-3-45 marks a significant step forward in addressing the 3D reconstruction challenges from a single image input.

Markdown Report Issue