Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models (2409.02851v1)

Published 4 Sep 2024 in cs.CV

Abstract: Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: https://human-vdm.github.io/Human-VDM/

Authors (4)

Zhibin Liu (7 papers)
Haoye Dong (21 papers)
Aviral Chharia (6 papers)
Hefeng Wu (35 papers)

Summary

Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

This document discusses a novel framework, Human-VDM, designed for generating lifelike 3D human models from single RGB images using video diffusion models (VDMs). The framework aims to address the challenges of achieving consistent views and high-quality texture when reconstructing 3D humans from single-view images—an area of significant interest in computer vision for applications in fields like filmmaking, gaming, and human-robot interaction.

Overview of Human-VDM

Human-VDM introduces a three-module pipeline:

View-Consistent Human Video Diffusion Module: This module takes a single image of a human as input and utilizes a fine-tuned VDM to generate a coherent, view-consistent orbital video of the subject.
Video Augmentation Module: This video is further processed using super-resolution and frame interpolation techniques to enhance the quality and smoothness of the generated frames.
3D Human Gaussian Splatting Module: Finally, this module employs 3D Gaussian splatting, leveraging the high-quality frames from the previous step to produce the final 3D human model with lifelike textures and detailed geometry.

Key Contributions and Modules

Human Video Diffusion Module:
- The module builds upon a fine-tuned SV3D model, specifically adapted to generate high-quality videos of humans by using the Thuman 2.0 dataset. SV3D originally targets object video generation, but required adjustments to handle detailed human textures and geometries.
- The fine-tuning process ensures the diffusion model generates videos with temporally consistent views, essential for accurate 3D reconstruction.
Video Augmentation Module:
- Super-Resolution: By employing CodeFormer, the module enhances the resolution of each frame by four times, significantly improving texture details.
- Frame Interpolation: PerVFI is used to interpolate frames within the video, creating smoother transitions and providing more consistent visual information for the 3D model.
3D Human Gaussian Splatting Module:
- Utilizes 3D Gaussian splatting, a point-based representation for real-time rendering. The module employs advanced techniques like optimized feature tensor training and SMPL models to fine-tune parameters and produce high-quality textures and geometries.
- LBS (Linear Blend Skinning) and appearance networks ensure dynamic, realistic representations of human movements and details.

Experimental Results

The method demonstrated superior performance over state-of-the-art (SOTA) models through extensive qualitative and quantitative evaluations. Noteworthy points include:

Quantitative Metrics: Human-VDM achieved higher scores in metrics such as SSIM (0.9228), CLIP Similarity (0.9235), and PSNR (20.068), coupled with the lowest LPIPS (0.0957), indicating the model's exceptional capability to generate high-fidelity 3D humans.
User Studies: Involving 30 volunteers, the studies overwhelmingly favored Human-VDM across categories like geometry quality (58.67%), texture quality (57.67%), and overall quality (53.66%), surpassing other approaches.

Implications and Future Work

Practical Implications

Applications: Human-VDM can significantly impact industries like virtual reality, gaming, and film production by enabling the rapid creation of high-quality 3D human models from minimal input.
Efficiency: The combination of VDMs with Gaussian splatting presents a scalable approach, potentially reducing computational overhead compared to traditional multi-view methods.

Theoretical Implications

Model Robustness: The work illustrates how diffusive models can be adapted and fine-tuned for specialized tasks, contributing to the broader understanding of VDMs in diverse applications.
Novel View Synthesis: Highlighting a practical pathway to overcome inconsistent view synthesis—a common challenge in many image-to-3D applications.

Speculation on Future Developments

Real-time Performance: Future research should explore optimizing video diffusion and Gaussian splatting modules for real-time 3D human generation.
Detailed Geometries: Addressing the noted limitations, particularly in generating accurate small-scale geometries like fingers and complex interactions (e.g., hand-face interactions), would be critical in enhancing model precision.
Extended Dataset Utilization: Utilizing more diverse human datasets could improve the model’s generalization across varied human appearances, poses, and clothing types.

This technical overview underscores Human-VDM's substantial contributions to single-image 3D human generation, positioning it as a notable advancement in the field of computational vision and rendering. The systematic integration of advanced video diffusion techniques and Gaussian splatting frameworks presents a promising direction for future research and practical applications.

PDF Markdown

Related Papers

GitHub

Tweets

https://twitter.com/janusch_patas/status/1831600388677476708