Rig3DGS: Creating Controllable Portraits from Casual Monocular Videos (2402.03723v1)

Published 6 Feb 2024 in cs.CV

Abstract: Creating controllable 3D human portraits from casual smartphone videos is highly desirable due to their immense value in AR/VR applications. The recent development of 3D Gaussian Splatting (3DGS) has shown improvements in rendering quality and training efficiency. However, it still remains a challenge to accurately model and disentangle head movements and facial expressions from a single-view capture to achieve high-quality renderings. In this paper, we introduce Rig3DGS to address this challenge. We represent the entire scene, including the dynamic subject, using a set of 3D Gaussians in a canonical space. Using a set of control signals, such as head pose and expressions, we transform them to the 3D space with learned deformations to generate the desired rendering. Our key innovation is a carefully designed deformation method which is guided by a learnable prior derived from a 3D morphable model. This approach is highly efficient in training and effective in controlling facial expressions, head positions, and view synthesis across various captures. We demonstrate the effectiveness of our learned deformation through extensive quantitative and qualitative experiments. The project page can be found at http://shahrukhathar.github.io/2024/02/05/Rig3DGS.html

Citations (11)

View on Semantic Scholar

Summary

The paper introduces a novel method using 3D Gaussians and a learnable 3DMM prior to achieve realistic reanimation of facial expressions and poses.
It efficiently integrates COLMAP-based point clouds with a FLAME mesh for dynamic regions, enabling 50× faster training and rendering compared to previous models.
Quantitative results demonstrate improved PSNR, SSIM, LPIPS, and DISTS, confirming superior rendering quality and practical applicability in various digital media applications.

Introduction

The capability to create controllable 3D human portraits from casual monocular video footage is a considerable advancement with significant implications for augmented and virtual reality, telepresence, film production, and educational applications. This type of technology allows users to re-animate captured individuals with novel facial expressions, head poses, and even novel viewpoints of the entire scene.

Rig3DGS Methodology

Rig3DGS represents a leap forward in rendering quality and training efficiency. It utilizes a collection of 3D Gaussians placed in a canonical space, which can be deftly transformed via a learnable deformation model to pose-dependent deformed spaces for differentiable rendering. A key novelty in Rig3DGS is the deployment of a deformation method guided by a learnable prior. This is derived from a 3D morphable model (3DMM), effectively serving as an anchor that the network's deformations adhere to, allowing for more realistic reanimation outcomes.

The process begins with a static approximation of the scene based on a predefined canonical frame and initializes a point cloud integrating background points provided by a structure-from-motion algorithm (COLMAP) with dynamic regions modeled by the FLAME mesh. The dynamic content can then be re-animated across different expressions and poses thanks to a deformation model confined to the space of the 3DMM's vertex deformations, realized as a weighted sum of deformations based on a photometric loss optimization criterion.

Performance and Comparison

Rig3DGS exhibits superior performance compared to the previous state-of-the-art models both quantitatively and qualitatively. Notably, it showcases vast improvements in rendering quality for the re-animation of faces, achieving high fidelity to target expressions and poses. It was found to be 50 times faster in training and rendering compared to existing MLP-based neural radiance fields like RigNeRF.

In controlled tests, biases that could have influenced the rendering capabilities unfavorably (such as varying illumination and motion) were minimized to better assess the model's performance. Quantitative analysis indicates that Rig3DGS consistently outperforms counterparts across different metrics including PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), LPIPS (Learned Perceptual Image Patch Similarity), and DISTS (Structural and Texture Image Similarity).

Ablation Studies

A crucial component of Rig3DGS is its learnable prior which effectively informs the deformation sub-process. Ablation studies highlight its integral role in achieving photorealistic reanimation, as models trained without this learnable prior or with a fixed prior (akin to RigNeRF) deliver substantially less accurate results. The reduced quality is evidenced by either failed reanimation or blurry reproductions of facial expressions and head poses.

Future Work and Conclusion

The presented approach, Rig3DGS, marks a significant contribution to the domain of generative AI and controllable 3D portraiture. Nevertheless, challenges such as handling non-uniform lighting and ensuring stability when the subject shows significant motion during video capture are raised as potential avenues for future improvements.

Rig3DGS's demonstrated capacity for detailed facial animation, its compatibility with novel view synthesis, and training efficiency bodes well for its applicability in real-world scenarios across various industries requiring life-like digital human representation.

PDF Markdown

Related Papers

GitHub

http://shahrukhathar.github.io/2024/02/05/Rig3DGS.html

Tweets

https://twitter.com/janusch_patas/status/1755081036617093256

https://twitter.com/arxivsanitybot/status/1755410878906716613