FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation (2412.16915v2)

Published 22 Dec 2024 in cs.CV, cs.AI, cs.GR, cs.SD, and eess.AS

Abstract: Diffusion-based audio-driven talking avatar methods have recently gained attention for their high-fidelity, vivid, and expressive results. However, their slow inference speed limits practical applications. Despite the development of various distillation techniques for diffusion models, we found that naive diffusion distillation methods do not yield satisfactory results. Distilled models exhibit reduced robustness with open-set input images and a decreased correlation between audio and video compared to teacher models, undermining the advantages of diffusion models. To address this, we propose FADA (Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation). We first designed a mixed-supervised loss to leverage data of varying quality and enhance the overall model capability as well as robustness. Additionally, we propose a multi-CFG distillation with learnable tokens to utilize the correlation between audio and reference image conditions, reducing the threefold inference runs caused by multi-CFG with acceptable quality degradation. Extensive experiments across multiple datasets show that FADA generates vivid videos comparable to recent diffusion model-based methods while achieving an NFE speedup of 4.17-12.5 times. Demos are available at our webpage http://fadavatar.github.io.

Summary

The paper introduces a novel diffusion distillation method that integrates mixed-supervision with a robust teacher-student framework to enhance avatar synthesis.
It achieves speedups from 4.17× to 12.5× while maintaining strong audio-video synchronization and high image quality.
Its multi-Classifier-Free Guidance process with learnable tokens optimizes computational efficiency for real-time applications in virtual assistants and entertainment.

Overview of FADA: Fast Diffusion Avatar Synthesis with Mixed-Supervised Multi-CFG Distillation

This paper introduces FADA, an advanced framework focusing on the acceleration of audio-driven talking avatar synthesis using diffusion models. Recent diffusion-based approaches in this domain have shown remarkable generative capabilities, producing high-quality, expressive videos synchronized with audio inputs. However, these methods generally suffer from slow inference speeds, limiting their applicability in real-time or processor-intensive contexts. The research aims to address these challenges by proposing a novel diffusion distillation technique with significant speed and quality optimizations.

Methodology

The proposed FADA framework employs a unique blend of techniques to enhance both the efficiency and robustness of talking avatar synthesis:

Mixed-Supervised Learning: FADA introduces a mixed-supervised loss function to capitalize on datasets of varying quality. This strategy overcomes the limitations that naive diffusion distillation methods present, such as diminished audio-video correlation and robustness against diverse input images. By training on a wider spectrum of data, the model enhances its precision and adaptability.
Multi-CFG Distillation with Learnable Tokens: The paper details a multi-Classifier-Free Guidance (multi-CFG) distillation process, integrating learnable tokens. This approach effectively mimics the multi-CFG process, facilitating the retention of the intricate relationships between audio and reference image conditions. The modifications reduce the number of computationally-intensive multi-CFG inference runs, achieving a speedup without significant quality degradation.
Robust Teacher-Student Framework: The framework sets up a teacher-student training model wherein a well-trained teacher model supervises the student model through distillation. The teacher model leverages a rigorously curated high-quality dataset, while the student benefits from a mixed dataset via adaptive weighting of distillation and ground-truth losses.
Multi-Condition Audio-driven Synthesis: The architecture of the FADA model retains complexity by incorporating audio attention and temporal consistency layers, which are essential for synthesizing natural-looking, expressive avatars in diverse conditions.

Experimental Results

FADA demonstrates substantial improvements over existing models in terms of speed and video quality. Quantitatively, the framework achieves an NFE speedup ranging from 4.17 to 12.5 times faster than competing diffusion-based approaches while maintaining comparable quality levels, as shown by traditional metrics such as IQA, Sync-D, FVD, and FID across standard datasets. The integration of multi-step and mixed data distillation strategies is particularly instrumental in realizing these improvements.

Practical Implications and Future Work

The implications of FADA extend beyond mere performance improvements; it presents a pathway for making high-quality talking avatar systems viable in more real-time applications, such as live interactive systems and online platforms. The fast and efficient framework can be key to scaling applications in real-time virtual assistants, media production, and entertainment.

Future research could explore further tuning of mixed data quality and adaptive supervision methods to refine model outputs further and expand the framework to other domains requiring rapid yet robust generative models. Additionally, interdisciplinary applications could capitalize on the model's ability to generate qualitatively rich, synchronized audiovisual outputs.

In sum, FADA presents a significant step in diffusion model research for avatar synthesis, combining cutting-edge techniques in AI to yield a powerful tool for advancing the practical deployment of talking avatars.