Audio-Driven Universal Gaussian Head Avatars

Published 23 Sep 2025 in cs.CV | (2509.18924v1)

Abstract: We introduce the first method for audio-driven universal photorealistic avatar synthesis, combining a person-agnostic speech model with our novel Universal Head Avatar Prior (UHAP). UHAP is trained on cross-identity multi-view videos. In particular, our UHAP is supervised with neutral scan data, enabling it to capture the identity-specific details at high fidelity. In contrast to previous approaches, which predominantly map audio features to geometric deformations only while ignoring audio-dependent appearance variations, our universal speech model directly maps raw audio inputs into the UHAP latent expression space. This expression space inherently encodes, both, geometric and appearance variations. For efficient personalization to new subjects, we employ a monocular encoder, which enables lightweight regression of dynamic expression variations across video frames. By accounting for these expression-dependent changes, it enables the subsequent model fine-tuning stage to focus exclusively on capturing the subject's global appearance and geometry. Decoding these audio-driven expression codes via UHAP generates highly realistic avatars with precise lip synchronization and nuanced expressive details, such as eyebrow movement, gaze shifts, and realistic mouth interior appearance as well as motion. Extensive evaluations demonstrate that our method is not only the first generalizable audio-driven avatar model that can account for detailed appearance modeling and rendering, but it also outperforms competing (geometry-only) methods across metrics measuring lip-sync accuracy, quantitative image quality, and perceptual realism.

Abstract PDF Upgrade to Chat

Summary

The paper introduces a Universal Head Avatar Prior (UHAP) that encodes both geometric and dynamic appearance variations for photorealistic avatar synthesis.
It employs a monocular encoder for efficient personalization, enabling rapid adaptation to new subjects with minimal data.
The diffusion-based speech model maps audio features to latent expression codes, achieving lifelike facial animations with high lip-sync accuracy.

Audio-Driven Universal Gaussian Head Avatars

The paper "Audio-Driven Universal Gaussian Head Avatars" proposes a novel framework for synthesizing photorealistic 3D head avatars driven by speech input. This work introduces a Universal Head Avatar Prior (UHAP), enabling the generation of high-fidelity avatars with effective lip synchronization and expressive facial motions across multiple identities. The authors address the limitations of previous approaches that generally focus only on geometric deformations, neglecting dynamic appearance variations induced by audio.

Key Contributions

Universal Head Avatar Prior (UHAP)

UHAP is a person-agnostic model trained on cross-identity multi-view videos that captures identity-specific details from neutral scans, facilitating the synthesis of avatars with high fidelity. Unlike traditional methods, UHAP encodes both geometric and appearance variations within its latent expression space. This allows for nuanced animations, including eyebrow movements, gaze shifts, and mouth interior dynamics.

Monocular Encoder for Efficient Personalization

For efficient personalization to new subjects, the authors employ a monocular encoder that performs lightweight regression of dynamic expression variations from video frames. This process facilitates rapid adaptation of the UHAP model to new identities, requiring only minimal data inputs such as a static scan or short video.

Audio-Driven Synthesis

The proposed diffusion-based speech model maps raw audio features directly into the UHAP's latent expression space. By decoding these audio-driven expression codes via UHAP, the system generates avatars that are not only realistic in motion but also visually compelling with dynamic appearance changes synchronized to the audio input.

Implementation Overview

Architecture

Expression Encoder: Utilizes variational autoencoder techniques to map deviations from neutral texture and geometry states into a latent expression code.
UHAP Decoder: Comprises three components—Neutral Decoder for identity features, Guide Mesh Decoder for vertex positions, and Gaussian Avatar Decoder for rendering the avatars.
Speech Model: Adopts a diffusion model to predict expression codes from audio features, leveraging self-attention and cross-attention mechanisms.

Training and Personalization

The framework relies on large-scale multi-view video datasets, ensuring high-fidelity synthesis. The personalization process involves fine-tuning the UHAP decoder with new subject-specific data, optimizing for identity and expression dynamics efficiently.

Experimental Evaluation

Qualitative and Quantitative Results

The method outperforms state-of-the-art geometry-only methods across lip-sync accuracy, image quality, and perceptual realism metrics. Qualitative assessments demonstrate the system's ability to produce sharp and detailed facial animations, including difficult-to-model regions such as the mouth interior and facial hair.

Figure 1: Overview of our Audio-Driven Universal Gaussian Avatar pipeline.

Ablation Studies

The paper conducts thorough ablation studies to assess the contributions of the neutral features and the pretraining of the monocular encoder. These studies reveal the importance of disentangling identity-specific details during training for achieving high-quality avatar synthesis.

Conclusion

The proposed framework sets a new benchmark for audio-driven avatar synthesis, combining high-fidelity geometric and appearance modeling. It demonstrates versatility in synthesizing realistic humanoid avatars from audio inputs while being adaptable to new identities with sparse data. Future work may focus on enhancing the system's robustness to challenging capture conditions, potentially driving further advancements in virtual communication and digital entertainment applications.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Audio-Driven Universal Gaussian Head Avatars

Summary

Audio-Driven Universal Gaussian Head Avatars

Key Contributions

Universal Head Avatar Prior (UHAP)

Monocular Encoder for Efficient Personalization

Audio-Driven Synthesis

Implementation Overview

Architecture

Training and Personalization

Experimental Evaluation

Qualitative and Quantitative Results

Ablation Studies

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Audio-Driven Universal Gaussian Head Avatars

Summary

Audio-Driven Universal Gaussian Head Avatars

Key Contributions

Universal Head Avatar Prior (UHAP)

Monocular Encoder for Efficient Personalization

Audio-Driven Synthesis

Implementation Overview

Architecture

Training and Personalization

Experimental Evaluation

Qualitative and Quantitative Results

Ablation Studies

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research