GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting (2404.16012v2)

Published 24 Apr 2024 in cs.CV and cs.MM

Abstract: We propose GaussianTalker, a novel framework for real-time generation of pose-controllable talking heads. It leverages the fast rendering capabilities of 3D Gaussian Splatting (3DGS) while addressing the challenges of directly controlling 3DGS with speech audio. GaussianTalker constructs a canonical 3DGS representation of the head and deforms it in sync with the audio. A key insight is to encode the 3D Gaussian attributes into a shared implicit feature representation, where it is merged with audio features to manipulate each Gaussian attribute. This design exploits the spatial-aware features and enforces interactions between neighboring points. The feature embeddings are then fed to a spatial-audio attention module, which predicts frame-wise offsets for the attributes of each Gaussian. It is more stable than previous concatenation or multiplication approaches for manipulating the numerous Gaussians and their intricate parameters. Experimental results showcase GaussianTalker's superiority in facial fidelity, lip synchronization accuracy, and rendering speed compared to previous methods. Specifically, GaussianTalker achieves a remarkable rendering speed up to 120 FPS, surpassing previous benchmarks. Our code is made available at https://github.com/KU-CVLAB/GaussianTalker/ .

Citations (4)

View on Semantic Scholar

Summary

The paper presents a novel framework using 3D Gaussian splatting to achieve real-time, high-fidelity talking head synthesis driven by audio inputs.
It integrates a spatial-audio attention module that synchronizes facial deformations with audio cues while efficiently encoding and manipulating Gaussian attributes.
Experimental results demonstrate significant improvements in lip synchronization and rendering speed of up to 120 FPS, setting a new standard for interactive digital human generation.

Overview of GaussianTalker: Real-Time High-Fidelity Talking Head Synthesis with Audio-Driven 3D Gaussian Splatting

GaussianTalker presents a novel framework for real-time synthesis of high-fidelity, pose-controllable talking heads driven by audio inputs. This approach capitalizes on the efficiency and expressiveness of 3D Gaussian Splatting (3DGS), an increasingly popular alternative to Neural Radiance Fields (NeRF), known for maintaining quality while significantly enhancing rendering speed.

Methodological Advancements

At the core of GaussianTalker is the innovative application of 3DGS to create audio-driven dynamic facial animations. By developing a static canonical 3D Gaussian representation of a head and synchronizing its deformations with audio cues, the authors address the limitations faced by previous methods in maintaining spatial cohesion and efficient parameter manipulation.

Gaussian attributes are encoded into a shared implicit feature space and merged with audio features to manage Gaussian attributes effectively. This design leverages spatially-aware features and enhances the interactions among neighboring points using a multi-resolution triplane. The spatial-audio attention module, a key component of this architecture, predicts frame-wise offsets for each Gaussian's attributes, offering increased stability over traditional concatenation or multiplication methods.

Numerical Results and Capabilities

GaussianTalker showcases marked improvements in terms of facial fidelity, lip synchronization, and rendering speed. Experimental evaluations reveal rendering speeds reaching up to 120 FPS, outpacing existing benchmarks. These improvements are substantial when contrasted with competitors, positioning GaussianTalker as a superior tool in generating photorealistic, audio-driven facial animations.

Implications and Future Directions

The implications of GaussianTalker's advances are significant for industries reliant on digital human generation, including virtual avatars, teleconferencing, and entertainment sectors. The real-time rendering capabilities combined with high-quality output present substantial potential for interactive digital experiences.

Future research could explore expanding GaussianTalker's applicability across multiple identities without additional per-identity training, enhancing its scalability and utility. Furthermore, investigating robust multi-view training and generalization techniques could unlock capabilities for full free-viewpoint synthesis, a current limitation of the model.

Conclusion

In summary, GaussianTalker represents a substantial contribution to the domain of real-time, high-fidelity talking head synthesis. By leveraging the inherent advantages of 3D Gaussian Splatting and overcoming its previous limitations through innovative technical solutions, GaussianTalker delivers high-quality facial animations, establishing a new standard in the synthesis of digital humans.

PDF Markdown

Related Papers

GitHub

GitHub - KU-CVLAB/GaussianTalker (355 stars)