- The paper introduces a two-stage process leveraging video diffusion models to generate multi-view consistent, high-resolution images for improved 3D reconstruction.
- It integrates camera pose conditioning and depth-aware refinement to enhance texture details and geometric accuracy.
- Experimental results on the GSO dataset show significant improvements in PSNR, SSIM, and Chamfer Distance over state-of-the-art methods.
Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models
In the paper titled "Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models," the authors tackle the persistent challenge in the domain of image-to-3D generation, particularly emphasizing the struggle to maintain multi-view consistency and achieve high-resolution textures using existing methods. Their work introduces a novel paradigm leveraging video diffusion models for high-fidelity 3D mesh reconstruction from single-view images.
Methodology
The core idea of Hi3D revolves around a two-stage process that integrates video diffusion models to improve both the resolution and consistency of generated multi-view images, ultimately leading to better 3D reconstructions.
- Stage-1: Basic Multi-View Generation
- The first stage focuses on transforming a single image into a sequence of low-resolution, 3D-aware multi-view images using a pre-trained video diffusion model. Unlike traditional 2D diffusion models that lack 3D contextual understanding, video diffusion models are trained on sequential data, inherently capturing temporal consistency. This temporal consistency in video diffusion models is mapped to geometric consistency in multi-view 3D generation.
- This stage introduces camera pose as an additional condition to the pre-trained video diffusion model. The model is fine-tuned on a dataset of multi-view images rendered from 3D assets, enabling it to generate geometrically consistent sequences from single images.
- Stage-2: 3D-Aware Multi-View Refinement
- The second stage enhances the resolution and detail of the images produced in the first stage. This is achieved with a 3D-aware video-to-video refiner that utilizes depth maps as additional input conditions. The depth information acts as a crucial 3D cue, refining the geometry and texture details across views.
- The multi-view images are scaled up to a high resolution of 1,024 × 1,024 pixels, ensuring rich detail and consistent visual quality suitable for high-fidelity 3D reconstruction.
3D Reconstruction Pipeline
After generating high-resolution multi-view images, the reconstruction of the 3D mesh is executed using the following steps:
- 3D Gaussian Splatting (3DGS): To address the challenge of reconstructing high-quality meshes from sparse views, Hi3D integrates 3DGS. This technique represents a scene through a set of 3D Gaussian primitives, facilitating real-time rendering and efficient optimization.
- Dense View Augmentation: The 3DGS model allows the generation of additional views at interpolation points, providing a dense set of multi-view images.
- SDF-Based Reconstruction: Utilizing the augmented dense views, a Signed Distance Function (SDF) approach is employed for the final extraction of high-quality 3D meshes.
Experimental Results
The authors validate Hi3D on two primary tasks: novel view synthesis and single view reconstruction, employing the Google Scanned Object (GSO) dataset for quantitative evaluation.
- Novel View Synthesis:
- Hi3D demonstrates superiority in generating high-resolution, multi-view consistent images, achieving significant improvements in metrics such as PSNR (24.26), SSIM (0.864), and LPIPS (0.119) compared to state-of-the-art methods like EpiDiff and SyncDreamer.
- Single View Reconstruction:
- Hi3D excels in reconstructing 3D meshes with rich detail and geometric accuracy, as reflected in its performance on Chamfer Distances (0.0172) and Volume IoU (0.6631).
Implications and Future Directions
The introduction of Hi3D marks a substantial progression in the field of image-to-3D generation. By harnessing the 3D-aware capabilities of video diffusion models, Hi3D addresses the limitations of previous methods, such as resolution constraints and multi-view inconsistency, setting a new benchmark for high-fidelity 3D reconstructions from single images.
The multi-stage approach adopted by Hi3D, particularly the use of 3D-aware video-to-video refinement with depth cues, opens new avenues for future research. Potential directions include exploring more sophisticated depth estimation techniques, integrating additional 3D priors, and expanding the applicability of Hi3D to more diverse and complex datasets.
In conclusion, Hi3D provides a compelling solution to the challenges of generating high-resolution, geometrically consistent multi-view images for 3D reconstruction. Its methodological innovations and strong numerical results highlight its effectiveness and pave the way for further advancements in this field.