- The paper introduces DEGAS, which integrates detailed facial expressions into full-body 3D avatars using a conditional variational autoencoder and Gaussian maps.
- It employs body signals from SMPL-X and face signals from a pre-trained expression encoder to drive photorealistic 3D Gaussian Splatting.
- Experiments on ActorsHQ and DREAMS-Avatar datasets show DEGAS outperforms previous models with high PSNR, SSIM, and 30 FPS real-time rendering.
Detailed Expressions on Full-Body Gaussian Avatars
The paper "DEGAS: Detailed Expressions on Full-Body Gaussian Avatars" introduces a novel method in the field of 3D avatar modeling. The research described in the paper bridges a notable gap in the existing literature, focusing on the integration of detailed facial expressions into full-body avatars via 3D Gaussian Splatting (3DGS).
The method, referred to as DEGAS (Detailed Expressions on Full-Body Gaussian Avatars), leverages a conditional variational autoencoder (cVAE) to model full-body avatars with rich facial expressions. This process utilizes multiview videos of subjects to learn and generate Gaussian maps in the UV layout, driven by both body motion and facial expressions.
Methodology
Driving Signal
The approach divides the driving signals into two categories:
- Body Signal: Derived from the body pose parameters of SMPL-X, encapsulating body motion control.
- Face Signal: Uses a pre-trained expression encoder from DPE (Detailed Pose Estimation), which captures expression-related appearance variations from 2D portrait images. This diverges from traditional 3D Morphable Models (3DMMs), which are limited in expressiveness and often suffer from efficiency issues.
cVAE and Gaussian Maps
The cVAE integrates both body and face signals through a series of encoders and a convolutional decoder:
- Pose θ Embedding: Encodes joint angles into UV layouts.
- Posed Vertex Map Encoders: Encode the vertex map of SMPL-X in both a spatially aware and a globally aware manner.
- Expression Encoder and Injection: Extracts expressions from multiple frontal views and integrates them into the body signal.
DEGAS initializes base Gaussian maps in a canonical space and applies these maps to the posed space through linear blend skinning (LBS), facilitating photorealistic rendering via 3DGS.
Results
Experiments conducted on the ActorsHQ and the newly proposed DREAMS-Avatar datasets validate the efficacy of DEGAS. The results show clear improvements in rendering quality and expressiveness of avatars. Specifically, quantitative measures such as PSNR, SSIM, LPIPS, and FID indicate that DEGAS consistently outperforms state-of-the-art methods like AnimatableGaussians, 3DGS-Avatar, and GaussianAvatar. For instance, DEGAS achieves a PSNR of 31.1 and an SSIM of 0.9708 on the ActorsHQ dataset, surpassing all compared methods.
Moreover, qualitative results reveal the method's capability in rendering high-quality details and capturing nuanced facial expressions. Reenactment tests on both same-identity and cross-identity sequences demonstrate the method's robustness and versatility in animating avatars under diverse scenarios.
Implications
DEGAS holds significant theoretical and practical implications. Theoretically, it introduces a new avenue for integrating 2D facial expression models into 3D full-body avatar systems. Practically, the incorporation of detailed facial expressions into animatable full-body avatars can enhance various applications, including telepresence, virtual companionship, and XR storytelling.
The method's ability to perform real-time rendering at a frame rate of 30 FPS makes it suitable for real-world interactive applications. Despite its success, the current implementation's dependence on the quality of the 2D facial expressions model suggests that future work could focus on integrating more advanced models like VASA to further enhance realism.
Conclusion
DEGAS represents a notable advancement in 3D avatar modeling by effectively integrating expressive facial details into full-body avatars using a 3DGS-based approach. The empirical success demonstrated through extensive experiments underscores its potential applicability in interactive AI agents and various digital communication platforms. Future research could extend this work by exploring alternative and more sophisticated expression encoders, addressing the modeling of loose clothing, and ensuring pose and identity disentanglement to further refine the reenactment quality.