- The paper introduces a novel method using 3D Gaussians and a learnable 3DMM prior to achieve realistic reanimation of facial expressions and poses.
- It efficiently integrates COLMAP-based point clouds with a FLAME mesh for dynamic regions, enabling 50× faster training and rendering compared to previous models.
- Quantitative results demonstrate improved PSNR, SSIM, LPIPS, and DISTS, confirming superior rendering quality and practical applicability in various digital media applications.
Introduction
The capability to create controllable 3D human portraits from casual monocular video footage is a considerable advancement with significant implications for augmented and virtual reality, telepresence, film production, and educational applications. This type of technology allows users to re-animate captured individuals with novel facial expressions, head poses, and even novel viewpoints of the entire scene.
Rig3DGS Methodology
Rig3DGS represents a leap forward in rendering quality and training efficiency. It utilizes a collection of 3D Gaussians placed in a canonical space, which can be deftly transformed via a learnable deformation model to pose-dependent deformed spaces for differentiable rendering. A key novelty in Rig3DGS is the deployment of a deformation method guided by a learnable prior. This is derived from a 3D morphable model (3DMM), effectively serving as an anchor that the network's deformations adhere to, allowing for more realistic reanimation outcomes.
The process begins with a static approximation of the scene based on a predefined canonical frame and initializes a point cloud integrating background points provided by a structure-from-motion algorithm (COLMAP) with dynamic regions modeled by the FLAME mesh. The dynamic content can then be re-animated across different expressions and poses thanks to a deformation model confined to the space of the 3DMM's vertex deformations, realized as a weighted sum of deformations based on a photometric loss optimization criterion.
Performance and Comparison
Rig3DGS exhibits superior performance compared to the previous state-of-the-art models both quantitatively and qualitatively. Notably, it showcases vast improvements in rendering quality for the re-animation of faces, achieving high fidelity to target expressions and poses. It was found to be 50 times faster in training and rendering compared to existing MLP-based neural radiance fields like RigNeRF.
In controlled tests, biases that could have influenced the rendering capabilities unfavorably (such as varying illumination and motion) were minimized to better assess the model's performance. Quantitative analysis indicates that Rig3DGS consistently outperforms counterparts across different metrics including PSNR (Peak Signal-to-Noise Ratio), SSIM (Structural Similarity Index), LPIPS (Learned Perceptual Image Patch Similarity), and DISTS (Structural and Texture Image Similarity).
Ablation Studies
A crucial component of Rig3DGS is its learnable prior which effectively informs the deformation sub-process. Ablation studies highlight its integral role in achieving photorealistic reanimation, as models trained without this learnable prior or with a fixed prior (akin to RigNeRF) deliver substantially less accurate results. The reduced quality is evidenced by either failed reanimation or blurry reproductions of facial expressions and head poses.
Future Work and Conclusion
The presented approach, Rig3DGS, marks a significant contribution to the domain of generative AI and controllable 3D portraiture. Nevertheless, challenges such as handling non-uniform lighting and ensuring stability when the subject shows significant motion during video capture are raised as potential avenues for future improvements.
Rig3DGS's demonstrated capacity for detailed facial animation, its compatibility with novel view synthesis, and training efficiency bodes well for its applicability in real-world scenarios across various industries requiring life-like digital human representation.