READ Avatars: Realistic Emotion-controllable Audio Driven Avatars (2303.00744v1)

Published 1 Mar 2023 in cs.CV, cs.GR, cs.SD, and eess.AS

Abstract: We present READ Avatars, a 3D-based approach for generating 2D avatars that are driven by audio input with direct and granular control over the emotion. Previous methods are unable to achieve realistic animation due to the many-to-many nature of audio to expression mappings. We alleviate this issue by introducing an adversarial loss in the audio-to-expression generation process. This removes the smoothing effect of regression-based models and helps to improve the realism and expressiveness of the generated avatars. We note furthermore, that audio should be directly utilized when generating mouth interiors and that other 3D-based methods do not attempt this. We address this with audio-conditioned neural textures, which are resolution-independent. To evaluate the performance of our method, we perform quantitative and qualitative experiments, including a user study. We also propose a new metric for comparing how well an actor's emotion is reconstructed in the generated avatar. Our results show that our approach outperforms state of the art audio-driven avatar generation methods across several metrics. A demo video can be found at \url{https://youtu.be/QSyMl3vV0pA}

Citations (8)

View on Semantic Scholar

Summary

The paper introduces a novel neural rendering technique using audio-conditioned textures and adversarial loss to achieve granular emotion control.
It employs a 3D morphable model and conditional GANs to convert audio features into precise facial parameters, ensuring accurate lip synchronization.
The research outperforms state-of-the-art methods with a quantifiable emotion capture metric, paving the way for advanced interactive avatar applications.

Introducing READ Avatars: A Leap Forward in Emotion-Controllable Audio-Driven Avatar Generation

Overview of READ Avatars

The "READ Avatars" paper introduces a method for generating high-quality, audio-driven 2D avatars with direct and granular control over expressed emotions. Leveraging a 3D-based approach, the technique addresses the challenge of the many-to-many nature of audio to expression mappings by incorporating an adversarial loss within the audio-to-expression generation process. This significant step not only improves the realism and expressiveness of avatars but also tackles the complexity of generating mouth interiors directly from audio inputs. A standout feature is the introduction of audio-conditioned neural textures that are independent of resolution, enhancing the detail and quality of the mouth's visual representation.

Key Contributions

Neural Rendering Algorithm: A novel approach that uses neural textures conditioned on audio to enhance the representation of mouth interiors, operating directly on UV coordinates.
Adversarial Loss: The implementation of a GAN loss in the audio-to-expression network aids in resolving the many-to-many issue commonly faced in audio-to-expression generation.
Emotion Capture Metric: The paper proposes a metric to evaluate the accuracy with which an actor's emotions are captured and reconstructed in the avatar, enriching the quantitative analysis toolkit for future research in this domain.

Novel Components and Methodology

The method outlined in the paper embarks on three critical stages: fitting a 3D Morphable Model to input videos, generating morphable model parameters from audio using adversarial training, and training an audio-conditioned deferred neural renderer for photo-realistic output.

Key Techniques:

Monocular Reconstruction: A low-dimensional parameter set is extracted to model video sequences, utilizing the FLAME model for capturing facial expressions and dynamics.
Audio-to-Parameter Generator: This generator leverages conditional GANs to convert MFCC audio features and explicit emotion labels into target parameters for animation, introducing fine-grained emotion control.
Neural Renderer: By encoding audio information onto the mesh surface via a neural texture network conditioned on audio, the renderer can produce photorealistic frames, significantly enhancing the visual quality and expressiveness of the generated avatars.

Evaluation and Findings

The method has been thoroughly tested against several metrics to evaluate visual quality, lip synchronization, and emotional clarity, outperforming existing state-of-the-art methodologies across various domains. Notably, the introduction of emotion-specific metrics like valence and arousal Emotional Mean Distance (A/V-EMD) provides a quantifiable measure of emotional reconstruction accuracy, a novel contribution that elevates the assessment standards for audio-driven avatar generation research.

Implications and Future Directions

The research presents a considerable advancement in the generation of audio-driven avatars, with potential applications spanning from virtual assistants to digital characters in entertainment and gaming. This method's ability to delivery both high visual quality and accurate lip synchronization, coupled with nuanced emotional expressiveness, sets a new benchmark in the field.

The introduction of adversarial learning and audio-conditioned neural textures within this domain opens avenues for further exploration, particularly in improving avatar realism and expressiveness. Moreover, future work could explore expanding this approach to include pose generation and even longer video generation capabilities, broadening the method's applicability and versatility.

Conclusion

"READ Avatars" marks a significant step forward in the generation of realistic, emotionally expressive audio-driven avatars. Through its innovative use of adversarial learning and neural textures conditioned on audio, it achieves an unprecedented level of realism and expressiveness. This research not only advances the field of digital avatar creation but also opens new pathways for future exploration and development in generative AI and 3D modeling.

PDF Markdown