REACT 2024: the Second Multiple Appropriate Facial Reaction Generation Challenge (2401.05166v1)
Abstract: In dyadic interactions, humans communicate their intentions and state of mind using verbal and non-verbal cues, where multiple different facial reactions might be appropriate in response to a specific speaker behaviour. Then, how to develop a ML model that can automatically generate multiple appropriate, diverse, realistic and synchronised human facial reactions from an previously unseen speaker behaviour is a challenging task. Following the successful organisation of the first REACT challenge (REACT 2023), this edition of the challenge (REACT 2024) employs a subset used by the previous challenge, which contains segmented 30-secs dyadic interaction clips originally recorded as part of the NOXI and RECOLA datasets, encouraging participants to develop and benchmark Machine Learning (ML) models that can generate multiple appropriate facial reactions (including facial image sequences and their attributes) given an input conversational partner's stimulus under various dyadic video conference scenarios. This paper presents: (i) the guidelines of the REACT 2024 challenge; (ii) the dataset utilized in the challenge; and (iii) the performance of the baseline systems on the two proposed sub-challenges: Offline Multiple Appropriate Facial Reaction Generation and Online Multiple Appropriate Facial Reaction Generation, respectively. The challenge baseline code is publicly available at https://github.com/reactmultimodalchallenge/baseline_react2024.
- N. Ambady and R. Rosenthal. Thin slices of expressive behavior as predictors of interpersonal consequences: A meta-analysis. Psychological bulletin, 111(2):256, 1992.
- Teach: Temporal action composition for 3d humans. In International Conference on 3D Vision 2022, 2022.
- Belfusion: Latent diffusion for behavior-driven human motion prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2317–2327, 2023.
- Didn’t see that coming: a survey on non-verbal social human behavior forecasting. In Understanding Social Behavior in Dyadic and Small Group Interactions, pages 139–178. PMLR, 2022.
- Comparison of spatio-temporal models for human motion and pose forecasting in face-to-face interaction scenarios. In C. Palmero, J. C. S. Jacques Junior, A. Clapés, I. Guyon, W.-W. Tu, T. B. Moeslund, and S. Escalera, editors, Understanding Social Behavior in Dyadic and Small Group Interactions, volume 173 of Proceedings of Machine Learning Research, pages 107–138. PMLR, 16 Oct 2022.
- The noxi database: multimodal recordings of mediated novice-expert interactions. In Proceedings of the 19th ACM International Conference on Multimodal Interaction, pages 350–359, 2017.
- Y. et al. The genea challenge 2022: A large evaluation of data-driven co-speech gesture generation. arXiv preprint arXiv:2208.10441, 2022.
- Beamer: Behavioral encoder to generate multiple appropriate facial reactions. In Proceedings of the ACM International Conference on Multimedia, pages 9536–9540, 2023.
- Y. Huang and S. M. Khan. Dyadgan: Generating facial expressions in dyadic interactions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 11–18, 2017.
- Unifarn: Unified transformer for facial reaction generation. In Proceedings of the ACM International Conference on Multimedia, pages 9506–9510, 2023.
- Learning multi-dimensional edge feature-based au relation graph for facial action unit recognition. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 1239–1246, 2022.
- Reactface: Multiple appropriate facial reaction generation in dyadic interactions. arXiv preprint arXiv:2305.15748, 2023.
- Learning to listen: Modeling non-deterministic dyadic facial motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20395–20405, 2022.
- Context-aware personality inference in dyadic scenarios: Introducing the udiva dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1–12, 2021.
- Introducing the recola multimodal corpus of remote collaborative and affective interactions. In 2013 10th IEEE international conference and workshops on automatic face and gesture recognition (FG), pages 1–8. IEEE, 2013.
- Personality recognition by modelling person-specific cognitive processes using graph representation. In proceedings of the 29th ACM international conference on multimedia, pages 357–366, 2021.
- Exploiting persona information for diverse generation of conversational responses. arXiv preprint arXiv:1905.12188, 2019.
- Learning person-specific cognition from facial reactions for automatic personality recognition. IEEE Transactions on Affective Computing, 2022.
- Gratis: Deep learning graph representation with task-specific topology and multi-dimensional edge features. arXiv preprint arXiv:2211.12482, 2022.
- React2023: The first multiple appropriate facial reaction generation challenge. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9620–9624, 2023.
- Multiple appropriate facial reaction generation in dyadic interaction settings: What, why and how? arXiv e-prints, pages arXiv–2302, 2023.
- Estimation of continuous valence and arousal levels from faces in naturalistic conditions. Nature Machine Intelligence, 3(1):42–50, 2021.
- Faceverse: a fine-grained and detail-controllable 3d face morphable model from a hybrid dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20333–20342, 2022.
- Mrecgen: Multimodal appropriate reaction generator. arXiv preprint arXiv:2307.02609, 2023.
- Reversible graph neural network-based reaction distribution learning for multiple appropriate facial reactions generation. arXiv preprint arXiv:2305.15270, 2023.
- Torchaudio: Building blocks for audio and speech processing. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6982–6986. IEEE, 2022.
- Leveraging the latent diffusion models for offline facial multiple appropriate reactions generation. In Proceedings of the ACM International Conference on Multimedia, pages 9561–9565, 2023.
- Responsive listening head generation: a benchmark dataset and baseline. In European Conference on Computer Vision, pages 124–142. Springer, 2022.