Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Robust Facial Reactions Generation: An Emotion-Aware Framework with Modality Compensation (2407.15798v2)

Published 22 Jul 2024 in cs.CV

Abstract: The objective of the Multiple Appropriate Facial Reaction Generation (MAFRG) task is to produce contextually appropriate and diverse listener facial behavioural responses based on the multimodal behavioural data of the conversational partner (i.e., the speaker). Current methodologies typically assume continuous availability of speech and facial modality data, neglecting real-world scenarios where these data may be intermittently unavailable, which often results in model failures. Furthermore, despite utilising advanced deep learning models to extract information from the speaker's multimodal inputs, these models fail to adequately leverage the speaker's emotional context, which is vital for eliciting appropriate facial reactions from human listeners. To address these limitations, we propose an Emotion-aware Modality Compensatory (EMC) framework. This versatile solution can be seamlessly integrated into existing models, thereby preserving their advantages while significantly enhancing performance and robustness in scenarios with missing modalities. Our framework ensures resilience when faced with missing modality data through the Compensatory Modality Alignment (CMA) module. It also generates more appropriate emotion-aware reactions via the Emotion-aware Attention (EA) module, which incorporates the speaker's emotional information throughout the entire encoding and decoding process. Experimental results demonstrate that our framework improves the appropriateness metric FRCorr by an average of 57.2\% compared to the original model structure. In scenarios where speech modality data is missing, the performance of appropriate generation shows an improvement, and when facial data is missing, it only exhibits minimal degradation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. The noxi database: multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 350–359, 2017.
  2. Finite scalar quantization as facial tokenizer for dyadic reaction generation. In The 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2024.
  3. Efficient emotional adaptation for audio-driven talking-head generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22634–22645, 2023.
  4. D. Kollias. Multi-label compound expression recognition: C-expr database & network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5589–5598, 2023.
  5. Ai-enabled analysis of 3-d ct scans for diagnosis of covid-19 & its severity. In 2023 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (ICASSPW), pages 1–5, 2023.
  6. Facernet: a facial expression intensity estimation network. arXiv preprint arXiv:2303.00180, 2023.
  7. Face behavior a la carte: Expressions, affect and action units in a single network. arXiv preprint arXiv:1910.11111, 2019.
  8. Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv preprint arXiv:2105.03790, 2021.
  9. Distribution matching for multi-task learning of classification tasks: a large-scale study on faces & beyond. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 2813–2821, 2024.
  10. Audio2gestures: Generating diverse gestures from speech audio with conditional variational autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11293–11302, 2021.
  11. Unifarn: Unified transformer for facial reaction generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9506–9510, 2023.
  12. Moel: Mixture of empathetic listeners. arXiv preprint arXiv:1908.07687, 2019.
  13. One-to-many appropriate reaction mapping modeling with discrete latent variable. In The 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2024.
  14. Reactface: Multiple appropriate facial reaction generation in dyadic interactions. arXiv preprint arXiv:2305.15748, 2023.
  15. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021.
  16. Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022.
  17. Multiple facial reaction generation using gaussian mixture of models and multimodal bottleneck transformer. In The 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2024.
  18. Vector quantized diffusion models for multiple appropriate reactions generation. In The 18th IEEE International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 2024.
  19. A. Psaroudakis and D. Kollias. Mixaugment & mixup: Augmentation methods for facial expression recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2367–2375, 2022.
  20. Introducing the recola multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1–8. IEEE, 2013.
  21. Personality recognition by modelling person-specific cognitive processes using graph representation. In proceedings of the 29th ACM international conference on multimedia, pages 357–366, 2021.
  22. Emotional listener portrait: Realistic listener motion simulation in conversation. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 20782–20792. IEEE, 2023.
  23. Learning person-specific cognition from facial reactions for automatic personality recognition. IEEE Transactions on Affective Computing, 14(4):3048–3065, 2022.
  24. React2023: the first multi-modal multiple appropriate facial reaction generation challenge. arXiv preprint arXiv:2306.06583, 2023.
  25. React2023: The first multiple appropriate facial reaction generation challenge. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9620–9624, 2023.
  26. React 2024: the second multiple appropriate facial reaction generation challenge. arXiv preprint arXiv:2401.05166, 2024.
  27. Multiple appropriate facial reaction generation in dyadic interaction settings: What, why and how? arXiv preprint arXiv:2302.06514, 2023.
  28. Multi-modal learning with missing modality via shared-specific feature modelling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15878–15887, 2023.
  29. High-fidelity generalized emotional talking face generation with multi-modal emotion space learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6609–6619, 2023.
  30. Reversible graph neural network-based reaction distribution learning for multiple appropriate facial reactions generation. arXiv preprint arXiv:2305.15270, 2023.
  31. Leveraging the latent diffusion models for offline facial multiple appropriate reactions generation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9561–9565, 2023.
  32. Responsive listening head generation: a benchmark dataset and baseline. In European Conference on Computer Vision, pages 124–142. Springer, 2022.

Summary

We haven't generated a summary for this paper yet.