UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model (2408.00762v1)

Published 1 Aug 2024 in cs.CV

Abstract: Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 hour, to 18.5 hours. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page https://github.com/X-niper/UniTalker.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a unified model that integrates diverse 3D annotation formats to enhance audio-driven facial animation.
It employs a multi-head encoder-decoder with PCA and a Pivot Identity Embedding to mitigate bias and reduce lip vertex error by up to 13.7%.
The model leverages the expanded A2F-Bench dataset, aggregating over 18.5 hours of multilingual audio from multiple sources to boost training diversity.

An Analytical Overview of UniTalker's Unified Model for Audio-Driven 3D Facial Animation

UniTalker presents a novel approach in the domain of audio-driven 3D facial animation by proposing a unified model designed to process inconsistent 3D annotations across multiple datasets. The paper addresses inherent limitations of previous models which were constrained by specific annotation formats, thereby restricting scalability. UniTalker's fundamental advancement lies in its capacity to handle diverse audio inputs and output various 3D facial annotation conventions through a unified mechanism.

Model Architecture and Core Innovations

UniTalker is structured around an encoder-decoder architecture featuring a multi-head design. This allows it to learn from multiple datasets with differing annotation formats, thus expanding training scalability and diversity. The model incorporates several key training strategies to manage the challenges of training instability and dataset bias. Principal Component Analysis (PCA) is employed to reduce the dimensionality of vertex-based annotations, which balances the parameters across multiple motion decoder heads. This facilitates stable training when dealing with a large number of 3D coordinates versus fewer parameters.

Additionally, the paper introduces the Pivot Identity Embedding (PIE) mechanism, which utilizes a "pseudo identity" to mitigate annotation biases between different motion decoder heads. This approach draws inspiration from classifier-free guidance techniques and is critical in ensuring that the generated facial animation does not become biased towards the predominant dataset annotations during training.

A2F-Bench: Dataset Compilation and Scaling

UniTalker successfully expands the scale and diversity of training data by assembling the A2F-Bench dataset, which encompasses five publicly available datasets alongside additional curated datasets. These datasets incorporate a variety of audio domains, including multilingual speeches and songs, extending the total training data to 18.5 hours from the typical less than 1-hour datasets used previously. The enriched dataset offers a comprehensive benchmark for evaluating audio-driven 3D facial animation methods.

Numerical Results and Evaluation

In the conducted experiments, UniTalker demonstrates superior numerical results compared to prior models. It reduces the Lip Vertex Error (LVE) by 9.2% on the BIWI dataset and 13.7% on the Vocaset when leveraging the full A2F-Bench training data. The model also outperforms previous state-of-the-art frameworks when fine-tuned on unseen datasets using only half of the available data, showcasing its effectiveness in handling annotation transfer tasks.

Implications and Potential Applications

The implications of UniTalker's unified model are significant both theoretically and practically. Theoretically, it sets a precedent for incorporating multiple datasets with inconsistent annotations into a single training framework, a challenge that has constrained prior research. Practically, UniTalker can serve as a foundation model for various audio-to-face tasks, particularly when the data scale is limited. Its ability to generalize across different audio types and languages suggests potential applications in immersive media, game development, and virtual human modeling in AI-driven interactions.

Future Developments and Research Directions

Future developments can focus on alleviating the remaining trade-offs between training scale and the variety of datasets by further enhancing the model's capacity. Exploring the use of large-scale datasets with less optimal data quality remains a prospective direction that could further improve generalization across broader audio domains. Furthermore, adapting the UniTalker framework for 2D facial animation and extending it to support larger head poses could significantly broaden its applicability.

Overall, the paper presents a robust approach in addressing the challenges of multi-annotation training through a unified model, setting a new benchmark in audio-driven facial animation research. Its contributions towards effective dataset integration and the introduction of innovative training strategies potentially mark a notable advancement in the field.