DEGAS: Detailed Expressions on Full-Body Gaussian Avatars (2408.10588v1)

Published 20 Aug 2024 in cs.CV and cs.GR

Abstract: Although neural rendering has made significant advancements in creating lifelike, animatable full-body and head avatars, incorporating detailed expressions into full-body avatars remains largely unexplored. We present DEGAS, the first 3D Gaussian Splatting (3DGS)-based modeling method for full-body avatars with rich facial expressions. Trained on multiview videos of a given subject, our method learns a conditional variational autoencoder that takes both the body motion and facial expression as driving signals to generate Gaussian maps in the UV layout. To drive the facial expressions, instead of the commonly used 3D Morphable Models (3DMMs) in 3D head avatars, we propose to adopt the expression latent space trained solely on 2D portrait images, bridging the gap between 2D talking faces and 3D avatars. Leveraging the rendering capability of 3DGS and the rich expressiveness of the expression latent space, the learned avatars can be reenacted to reproduce photorealistic rendering images with subtle and accurate facial expressions. Experiments on an existing dataset and our newly proposed dataset of full-body talking avatars demonstrate the efficacy of our method. We also propose an audio-driven extension of our method with the help of 2D talking faces, opening new possibilities to interactive AI agents.

Summary

The paper introduces DEGAS, which integrates detailed facial expressions into full-body 3D avatars using a conditional variational autoencoder and Gaussian maps.
It employs body signals from SMPL-X and face signals from a pre-trained expression encoder to drive photorealistic 3D Gaussian Splatting.
Experiments on ActorsHQ and DREAMS-Avatar datasets show DEGAS outperforms previous models with high PSNR, SSIM, and 30 FPS real-time rendering.

Detailed Expressions on Full-Body Gaussian Avatars

The paper "DEGAS: Detailed Expressions on Full-Body Gaussian Avatars" introduces a novel method in the field of 3D avatar modeling. The research described in the paper bridges a notable gap in the existing literature, focusing on the integration of detailed facial expressions into full-body avatars via 3D Gaussian Splatting (3DGS).

The method, referred to as DEGAS (Detailed Expressions on Full-Body Gaussian Avatars), leverages a conditional variational autoencoder (cVAE) to model full-body avatars with rich facial expressions. This process utilizes multiview videos of subjects to learn and generate Gaussian maps in the UV layout, driven by both body motion and facial expressions.

Methodology

Driving Signal

The approach divides the driving signals into two categories:

Body Signal: Derived from the body pose parameters of SMPL-X, encapsulating body motion control.
Face Signal: Uses a pre-trained expression encoder from DPE (Detailed Pose Estimation), which captures expression-related appearance variations from 2D portrait images. This diverges from traditional 3D Morphable Models (3DMMs), which are limited in expressiveness and often suffer from efficiency issues.

cVAE and Gaussian Maps

The cVAE integrates both body and face signals through a series of encoders and a convolutional decoder:

Pose $\boldsymbol\theta$ Embedding: Encodes joint angles into UV layouts.
Posed Vertex Map Encoders: Encode the vertex map of SMPL-X in both a spatially aware and a globally aware manner.
Expression Encoder and Injection: Extracts expressions from multiple frontal views and integrates them into the body signal.

DEGAS initializes base Gaussian maps in a canonical space and applies these maps to the posed space through linear blend skinning (LBS), facilitating photorealistic rendering via 3DGS.

Results

Experiments conducted on the ActorsHQ and the newly proposed DREAMS-Avatar datasets validate the efficacy of DEGAS. The results show clear improvements in rendering quality and expressiveness of avatars. Specifically, quantitative measures such as PSNR, SSIM, LPIPS, and FID indicate that DEGAS consistently outperforms state-of-the-art methods like AnimatableGaussians, 3DGS-Avatar, and GaussianAvatar. For instance, DEGAS achieves a PSNR of 31.1 and an SSIM of 0.9708 on the ActorsHQ dataset, surpassing all compared methods.

Moreover, qualitative results reveal the method's capability in rendering high-quality details and capturing nuanced facial expressions. Reenactment tests on both same-identity and cross-identity sequences demonstrate the method's robustness and versatility in animating avatars under diverse scenarios.

Implications

DEGAS holds significant theoretical and practical implications. Theoretically, it introduces a new avenue for integrating 2D facial expression models into 3D full-body avatar systems. Practically, the incorporation of detailed facial expressions into animatable full-body avatars can enhance various applications, including telepresence, virtual companionship, and XR storytelling.

The method's ability to perform real-time rendering at a frame rate of 30 FPS makes it suitable for real-world interactive applications. Despite its success, the current implementation's dependence on the quality of the 2D facial expressions model suggests that future work could focus on integrating more advanced models like VASA to further enhance realism.

Conclusion

DEGAS represents a notable advancement in 3D avatar modeling by effectively integrating expressive facial details into full-body avatars using a 3DGS-based approach. The empirical success demonstrated through extensive experiments underscores its potential applicability in interactive AI agents and various digital communication platforms. Future research could extend this work by exploring alternative and more sophisticated expression encoders, addressing the modeling of loose clothing, and ensuring pose and identity disentanglement to further refine the reenactment quality.

Related Papers

Tweets

https://twitter.com/janusch_patas/status/1826105047588983006