Expressive Gaussian Human Avatars from Monocular RGB Video (2407.03204v1)

Published 3 Jul 2024 in cs.CV

Abstract: Nuanced expressiveness, particularly through fine-grained hand and facial expressions, is pivotal for enhancing the realism and vitality of digital human representations. In this work, we focus on investigating the expressiveness of human avatars when learned from monocular RGB video; a setting that introduces new challenges in capturing and animating fine-grained details. To this end, we introduce EVA, a drivable human model that meticulously sculpts fine details based on 3D Gaussians and SMPL-X, an expressive parametric human model. Focused on enhancing expressiveness, our work makes three key contributions. First, we highlight the critical importance of aligning the SMPL-X model with RGB frames for effective avatar learning. Recognizing the limitations of current SMPL-X prediction methods for in-the-wild videos, we introduce a plug-and-play module that significantly ameliorates misalignment issues. Second, we propose a context-aware adaptive density control strategy, which is adaptively adjusting the gradient thresholds to accommodate the varied granularity across body parts. Last but not least, we develop a feedback mechanism that predicts per-pixel confidence to better guide the learning of 3D Gaussians. Extensive experiments on two benchmarks demonstrate the superiority of our framework both quantitatively and qualitatively, especially on the fine-grained hand and facial details. See the project website at \url{https://evahuman.github.io}

Summary

The paper introduces EVA, which combines SMPL-X alignment and 3D Gaussian splatting to accurately capture detailed hand and facial expressions from monocular video.
It features a plug-and-play alignment module and context-aware adaptive density control that mitigate misalignment and optimize Gaussian representation across body parts.
Extensive experiments on XHumans and UPB datasets demonstrate relative LPIPS improvements up to 22.5%, validating EVA’s superior performance.

Expressive Gaussian Human Avatars from Monocular RGB Video

Introduction

The paper "Expressive Gaussian Human Avatars from Monocular RGB Video" from researchers at the University of Texas at Austin and the University of Cambridge introduces EVA (Expressive Visual Avatars), a human avatar modeling approach. EVA aims to enhance digital human representations, focusing on fine-grained hand and facial expressions, by utilizing a monocular RGB video as input. This modeling technique is based on 3D Gaussian splatting and the SMPL-X parametric human model.

Key Contributions

The authors state three major contributions of their work:

Plug-and-Play Alignment Module:
- A significant enhancement in aligning the SMPL-X model with the RGB frames to address misalignment issues.
Context-Aware Adaptive Density Control:
- A new strategy for 3D Gaussian optimization that adjusts the granularity across different body parts adaptively.
Benchmarking and Superiority:
- Extensive experiments on two datasets (XHumans and UPB) to demonstrate EVA’s superiority both quantitatively and qualitatively, particularly in fine-grained details of hands and faces.

Technical Approach

EVA operates on several technical foundations and innovations to deliver its expressiveness and accuracy:

SMPL-X Alignment

The SMPL-X model is a parametric model designed to encapsulate body, hand, and facial movements. However, accurately aligning this model with real-world RGB frames poses significant challenges. The authors introduce a robust fitting procedure that leverages pseudo Ground Truth (GT) from multiple sources, including 2D keypoints and 3D hand parameters. This fitting aims to minimize errors and better initialize the optimization process for the human model.

Gaussian Splatting

The modeling process involves 3D Gaussian splatting, which offers real-time rendering speeds and simplifies the scene's representation as sets of discrete 3D Gaussians. This approach contrasts with the slower NeRF-based methods and allows efficient handling of dynamic scenes. Gaussian optimization is carried out in a canonical space with transformations to frame space facilitated by Linear Blend Skinning (LBS).

Adaptive Density Control

The context-aware adaptive density control strategy is pivotal for managing variances in different body parts’ granularities. It involves monitoring Gaussian positional gradients and adjusting densification thresholds accordingly. This feature significantly enhances EVA’s capability to maintain the detailed textures and high degrees of freedom necessary for expressive rendering.

Confidence-Aware Loss

To further improve the robustness of the generated avatars, the authors propose a confidence-aware loss mechanism. This mechanism assigns confidence scores to pixels based on rendered images and depths, weighing per-pixel consistency accordingly. This mitigates the effects of noise and inconsistencies arising from the RGB video inputs.

Experimental Validation

The evaluation of EVA is conducted on two datasets:

XHumans: A controlled environment dataset providing high-quality SMPL-X annotations.
UPB: A real-world dataset collected from sign language videos on the web with complex hand gestures and no pose annotations.

The results, both quantitative and qualitative, demonstrate EVA’s superior performance. On the XHumans dataset, EVA presents significant improvements, achieving 19.7%, 17.3%, and 22.5% relative LPIPS gain on full-body, hand, and face regions, respectively. On the more challenging real-world UPB dataset, EVA extends this margin further, showcasing over 25% relative LPIPS gains for hand regions, validating the robustness of the SMPL-X alignment and adaptive density control.

Implications and Future Work

The advancements presented in this paper hold substantial implications for AI-driven digital human modeling. Practically, EVA can enhance applications in VR/AR, movie production, and video games by providing highly expressive and detailed digital avatars. Theoretically, this work underscores the importance of detailed part-level modeling and adaptive optimization strategies in handling the inherent complexity of human anatomy and expressiveness.

Future research could explore the integration of non-rigid elements like cloth and hair, posing additional challenges in dynamic digital human modeling. Furthermore, improving robustness against occlusions and enabling few-shot learning scenarios could significantly broaden the applicability of such models in more diverse real-world settings.

Conclusion

EVA represents a methodological advancement in expressive human avatar modeling from monocular RGB video inputs. The incorporation of adaptive density control and a robust SMPL-X alignment mechanism has successfully addressed longstanding issues in fine-grained detail capture, particularly for hands and facial regions. This work provides a solid foundation for future explorations into more dynamic and multimodal human representation systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_vztu/status/1808973970718478365

https://twitter.com/VidaofAi/status/1809196380084154665

https://twitter.com/arxivsanitybot/status/1809408692552446042