REACT2023: the first Multi-modal Multiple Appropriate Facial Reaction Generation Challenge (2306.06583v1)
Abstract: The Multi-modal Multiple Appropriate Facial Reaction Generation Challenge (REACT2023) is the first competition event focused on evaluating multimedia processing and machine learning techniques for generating human-appropriate facial reactions in various dyadic interaction scenarios, with all participants competing strictly under the same conditions. The goal of the challenge is to provide the first benchmark test set for multi-modal information processing and to foster collaboration among the audio, visual, and audio-visual affective computing communities, to compare the relative merits of the approaches to automatic appropriate facial reaction generation under different spontaneous dyadic interaction conditions. This paper presents: (i) novelties, contributions and guidelines of the REACT2023 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at \url{https://github.com/reactmultimodalchallenge/baseline_react2023}.
- Nalini Ambady and Robert Rosenthal. 1992. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychological bulletin 111, 2 (1992), 256.
- TEACH: Temporal Action Composition for 3D Humans. In International Conference on 3D Vision 2022.
- BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction. arXiv preprint arXiv:2211.14304 (2022).
- Didn’t see that coming: a survey on non-verbal social human behavior forecasting. In Understanding Social Behavior in Dyadic and Small Group Interactions. PMLR, 139–178.
- Comparison of Spatio-Temporal Models for Human Motion and Pose Forecasting in Face-to-Face Interaction Scenarios. In Understanding Social Behavior in Dyadic and Small Group Interactions (Proceedings of Machine Learning Research, Vol. 173), Cristina Palmero, Julio C. S. Jacques Junior, Albert Clapés, Isabelle Guyon, Wei-Wei Tu, Thomas B. Moeslund, and Sergio Escalera (Eds.). PMLR, 107–138.
- The NoXi database: multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 350–359.
- A realistic face-to-face conversation system based on deep neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0.
- Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems 32 (2019).
- Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM international conference on Multimedia. 1459–1462.
- FaceFormer: Speech-Driven 3D Facial Animation with Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18770–18780.
- Joint audio-text model for expressive speech-driven 3d facial animation. Proceedings of the ACM on Computer Graphics and Interactive Techniques 5, 1 (2022), 1–15.
- Affective Faces for Goal-Driven Dyadic Communication. arXiv preprint arXiv:2301.10939 (2023).
- Generative adversarial networks. Commun. ACM 63, 11 (2020), 139–144.
- Yuchi Huang and Saad Khan. 2018a. A generative approach for dynamically varying photorealistic facial expressions in human-agent interactions. In Proceedings of the 20th ACM International Conference on Multimodal Interaction. 437–445.
- Yuchi Huang and Saad M Khan. 2017. Dyadgan: Generating facial expressions in dyadic interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 11–18.
- Yuchi Huang and Saad M Khan. 2018b. Generating Photorealistic Facial Expressions in Dyadic Interactions.. In BMVC. 201.
- Let’s face it: Probabilistic multi-modal interlocutor-aware generation of facial gestures in dyadic settings. In Proceedings of the 20th ACM International Conference on Intelligent Virtual Agents. 1–8.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
- Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 1239–1246.
- ReactFace: Multiple Appropriate Facial Reaction Generation in Dyadic Interactions. arXiv preprint arXiv:2305.15748 (2023).
- Mehdi Mirza and Simon Osindero. 2014. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014).
- Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20395–20405.
- Interactive generative adversarial networks for facial expression generation in dyadic interactions. arXiv preprint arXiv:1801.09092 (2018).
- Dynamic facial expression generation on hilbert hypersphere with conditional wasserstein generative adversarial nets. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 2 (2020), 848–863.
- ChaLearn LAP Challenges on Self-Reported Personality Recognition and Non-Verbal Behavior Forecasting During Social Dyadic Interactions: Dataset, Design, and Results. In Understanding Social Behavior in Dyadic and Small Group Interactions (Proceedings of Machine Learning Research, Vol. 173), Cristina Palmero, Julio C. S. Jacques Junior, Albert Clapés, Isabelle Guyon, Wei-Wei Tu, Thomas B. Moeslund, and Sergio Escalera (Eds.). 4–52.
- Context-aware personality inference in dyadic scenarios: Introducing the udiva dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1–12.
- Diffusion autoencoders: Toward a meaningful and decodable representation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10619–10629.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409 (2021).
- Ganimation: Anatomically-aware facial animation from a single image. In Proceedings of the European conference on computer vision (ECCV). 818–833.
- Pirenderer: Controllable portrait image generation via semantic neural rendering. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13759–13768.
- Meshtalk: 3d face animation from speech using cross-modality disentanglement. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1173–1182.
- Introducing the RECOLA multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG). IEEE, 1–8.
- Personality recognition by modelling person-specific cognitive processes using graph representation. In proceedings of the 29th ACM international conference on multimedia. 357–366.
- Learning Person-specific Cognition from Facial Reactions for Automatic Personality Recognition. IEEE Transactions on Affective Computing (2022).
- GRATIS: Deep Learning Graph Representation with Task-specific Topology and Multi-dimensional Edge Features. arXiv preprint arXiv:2211.12482 (2022).
- Multiple Appropriate Facial Reaction Generation in Dyadic Interaction Settings: What, Why and How? arXiv e-prints (2023), arXiv–2302.
- Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence 3, 1 (2021), 42–50.
- Nguyen Tan Viet Tuyen and Oya Celiktutan. 2022. Context-Aware Human Behaviour Forecasting in Dyadic Interactions. In Understanding Social Behavior in Dyadic and Small Group Interactions (Proceedings of Machine Learning Research, Vol. 173), Cristina Palmero, Julio C. S. Jacques Junior, Albert Clapés, Isabelle Guyon, Wei-Wei Tu, Thomas B. Moeslund, and Sergio Escalera (Eds.). PMLR, 88–106.
- Faceverse: a fine-grained and detail-controllable 3d face morphable model from a hybrid dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20333–20342.
- AMII: Adaptive Multimodal Inter-personal and Intra-personal Model for Adapted Behavior Synthesis. arXiv preprint arXiv:2305.11310 (2023).
- Creating an interactive human/agent loop using multimodal recurrent neural networks. In WACAI 2021.
- CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior. arXiv preprint arXiv:2301.02379 (2023).
- Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation. arXiv preprint arXiv:2305.15270 (2023).
- Torchaudio: Building blocks for audio and speech processing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6982–6986.
- Ye Yuan and Kris Kitani. 2020. Dlow: Diversifying latent flows for diverse human motion prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 346–364.
- Responsive listening head generation: a benchmark dataset and baseline. In European Conference on Computer Vision. Springer, 124–142.