AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction (2412.02684v1)

Published 3 Dec 2024 in cs.CV and cs.AI

Abstract: Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.

Summary

The paper introduces a dual-stage framework that combines diffusion transformers with 4D Gaussian Splatting to generate animatable 3D avatars from a single image.
The method significantly outperforms state-of-the-art approaches with improved PSNR, SSIM, and LPIPS metrics for multi-view synthesis and animation.
The approach enables real-time animation with robust shape regularization, paving the way for personalized digital avatars in VR and gaming.

Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

The paper presents AniGS, an innovative system for creating animatable 3D human avatars from a single image, addressing several limitations in existing methods. It introduces a novel dual-stage approach combining generative models for multi-view image synthesis with advanced reconstruction techniques, bridging the gap between static 3D reconstruction and animatable human modeling.

Methodology and Contributions

The core innovation in AniGS lies in its two-stage architecture: image generation followed by robust 3D reconstruction. Firstly, the framework employs a reference image-guided video generation model to create high-quality, multi-view canonical images coupled with normal maps. This stage utilizes a diffusion transformer model, specifically adapted to synthesize multi-view human images in canonical poses from in-the-wild datasets. The model is pre-trained on extensive real-world video datasets, bypassing the need for synthetic 3D datasets.

In the second stage, AniGS addresses the issue of inconsistencies in generated multi-view images by recasting the 3D reconstruction task as a 4D problem. It introduces a 4D Gaussian Splatting (4DGS) model optimized to account for temporal inconsistencies across views. This approach enhances the reliability of the reconstructed avatars, yielding a high-fidelity model suitable for real-time animation. Specifically, the model incorporates shape regularization techniques to mitigate spikes and artifacts during animation.

Results and Evaluation

The paper demonstrates that the AniGS significantly outperforms existing state-of-the-art methods like CHAMP and MagicMan, especially concerning consistency and quality of the generated avatars. The evaluations, conducted on synthetic datasets, show marked improvements in PSNR, SSIM, and LPIPS metrics for both multi-view image generation and animation tasks.

Notably, the robustness of the 4DGS model is validated through a series of experimental results where the system successfully generates high-quality animatable avatars from single, in-the-wild images, supporting real-time applications without compromising photorealism. This capability underscores its potential relevance in domains such as virtual reality and gaming, where real-time interaction is pivotal.

Implications and Future Directions

The implications of AniGS are profound. The potential to reconstruct animatable avatars from a single image opens new avenues in creating personalized digital avatars rapidly and efficiently. Moreover, the paper's approach to handling inconsistencies through 4D representations presents a scalable solution to dynamic scene reconstruction challenges.

Future research could explore more efficient feed-forward reconstruction techniques to reduce the preprocessing time further, as identified in the current limitations. Additionally, expanding the training dataset to include broader postures and clothing styles could enhance the generalization capabilities of the model.

In conclusion, the paper contributes a significant advancement in animatable avatar generation, offering a method that combines the strengths of generative modeling with sophisticated reconstruction tactics to achieve real-time, high-fidelity outputs. As digital human modeling continues to evolve, such approaches will likely play a crucial role in shaping the next generation of interactive virtual environments.