Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models (2409.07452v1)

Published 11 Sep 2024 in cs.CV and cs.MM

Abstract: Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness. In this work, we present High-resolution Image-to-3D model (Hi3D), a new video diffusion based paradigm that redefines a single image to multi-view images as 3D-aware sequential image generation (i.e., orbital video generation). This methodology delves into the underlying temporal consistency knowledge in video diffusion model that generalizes well to geometry consistency across multiple views in 3D generation. Technically, Hi3D first empowers the pre-trained video diffusion model with 3D-aware prior (camera pose condition), yielding multi-view images with low-resolution texture details. A 3D-aware video-to-video refiner is learnt to further scale up the multi-view images with high-resolution texture details. Such high-resolution multi-view images are further augmented with novel views through 3D Gaussian Splatting, which are finally leveraged to obtain high-fidelity meshes via 3D reconstruction. Extensive experiments on both novel view synthesis and single view reconstruction demonstrate that our Hi3D manages to produce superior multi-view consistency images with highly-detailed textures. Source code and data are available at \url{https://github.com/yanghb22-fdu/Hi3D-Official}.

Citations (3)

Summary

  • The paper introduces a two-stage process leveraging video diffusion models to generate multi-view consistent, high-resolution images for improved 3D reconstruction.
  • It integrates camera pose conditioning and depth-aware refinement to enhance texture details and geometric accuracy.
  • Experimental results on the GSO dataset show significant improvements in PSNR, SSIM, and Chamfer Distance over state-of-the-art methods.

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

In the paper titled "Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models," the authors tackle the persistent challenge in the domain of image-to-3D generation, particularly emphasizing the struggle to maintain multi-view consistency and achieve high-resolution textures using existing methods. Their work introduces a novel paradigm leveraging video diffusion models for high-fidelity 3D mesh reconstruction from single-view images.

Methodology

The core idea of Hi3D revolves around a two-stage process that integrates video diffusion models to improve both the resolution and consistency of generated multi-view images, ultimately leading to better 3D reconstructions.

  1. Stage-1: Basic Multi-View Generation
    • The first stage focuses on transforming a single image into a sequence of low-resolution, 3D-aware multi-view images using a pre-trained video diffusion model. Unlike traditional 2D diffusion models that lack 3D contextual understanding, video diffusion models are trained on sequential data, inherently capturing temporal consistency. This temporal consistency in video diffusion models is mapped to geometric consistency in multi-view 3D generation.
    • This stage introduces camera pose as an additional condition to the pre-trained video diffusion model. The model is fine-tuned on a dataset of multi-view images rendered from 3D assets, enabling it to generate geometrically consistent sequences from single images.
  2. Stage-2: 3D-Aware Multi-View Refinement
    • The second stage enhances the resolution and detail of the images produced in the first stage. This is achieved with a 3D-aware video-to-video refiner that utilizes depth maps as additional input conditions. The depth information acts as a crucial 3D cue, refining the geometry and texture details across views.
    • The multi-view images are scaled up to a high resolution of 1,024 × 1,024 pixels, ensuring rich detail and consistent visual quality suitable for high-fidelity 3D reconstruction.

3D Reconstruction Pipeline

After generating high-resolution multi-view images, the reconstruction of the 3D mesh is executed using the following steps:

  • 3D Gaussian Splatting (3DGS): To address the challenge of reconstructing high-quality meshes from sparse views, Hi3D integrates 3DGS. This technique represents a scene through a set of 3D Gaussian primitives, facilitating real-time rendering and efficient optimization.
  • Dense View Augmentation: The 3DGS model allows the generation of additional views at interpolation points, providing a dense set of multi-view images.
  • SDF-Based Reconstruction: Utilizing the augmented dense views, a Signed Distance Function (SDF) approach is employed for the final extraction of high-quality 3D meshes.

Experimental Results

The authors validate Hi3D on two primary tasks: novel view synthesis and single view reconstruction, employing the Google Scanned Object (GSO) dataset for quantitative evaluation.

  • Novel View Synthesis:
    • Hi3D demonstrates superiority in generating high-resolution, multi-view consistent images, achieving significant improvements in metrics such as PSNR (24.26), SSIM (0.864), and LPIPS (0.119) compared to state-of-the-art methods like EpiDiff and SyncDreamer.
  • Single View Reconstruction:
    • Hi3D excels in reconstructing 3D meshes with rich detail and geometric accuracy, as reflected in its performance on Chamfer Distances (0.0172) and Volume IoU (0.6631).

Implications and Future Directions

The introduction of Hi3D marks a substantial progression in the field of image-to-3D generation. By harnessing the 3D-aware capabilities of video diffusion models, Hi3D addresses the limitations of previous methods, such as resolution constraints and multi-view inconsistency, setting a new benchmark for high-fidelity 3D reconstructions from single images.

The multi-stage approach adopted by Hi3D, particularly the use of 3D-aware video-to-video refinement with depth cues, opens new avenues for future research. Potential directions include exploring more sophisticated depth estimation techniques, integrating additional 3D priors, and expanding the applicability of Hi3D to more diverse and complex datasets.

In conclusion, Hi3D provides a compelling solution to the challenges of generating high-resolution, geometrically consistent multi-view images for 3D reconstruction. Its methodological innovations and strong numerical results highlight its effectiveness and pave the way for further advancements in this field.