Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations (2403.02090v3)
Abstract: Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.
- No gestures left behind: Learning relationships between spoken language and freeform gestures. In Findings of Conference on Empirical Methods in Natural Language Processing, pages 1884–1895, 2020.
- Continual learning for personalized co-speech gesture generation. In IEEE/CVF International Conference on Computer Vision, pages 20893–20903, 2023.
- Data-free class-incremental hand gesture recognition. In IEEE/CVF International Conference on Computer Vision, pages 20958–20967, 2023.
- Emotional speech corpus for persuasive dialogue system. In Language Resources and Evaluation Conference, pages 491–497, 2020.
- Human-side strategies in the werewolf game against the stealth werewolf strategy. In International Conference on Computers and Games, pages 93–102. Springer, 2016.
- Mafia: A theoretical study of players and coalitions in a partial information environment. The Annals of Applied Probability, 18(3):825–846, 2008.
- Jean Carletta. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249–254, 1996.
- Casino: A corpus of campsite negotiation dialogues for automatic negotiation systems. arXiv preprint arXiv:2103.15721, 2021.
- Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10761–10770, 2023.
- Are you awerewolf? detecting deceptive roles and outcomes in a conversational role-playing game. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5334–5337. IEEE, 2010.
- Detecting attended visual targets in video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5396–5406, 2020.
- Robert Chuchro. Training an assassin ai for the resistance: Avalon. arXiv preprint arXiv:2209.09331, 2022.
- Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
- Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
- Multiscale vision transformers. In IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
- Inferring shared attention in social scene videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6460–6468, 2018.
- Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Dual attention guided gaze target detection in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11390–11399, 2021.
- Ego4d: Around the world in 3,000 hours of egocentric video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
- A probabilistic model of gaze imitation and shared attention. Neural Networks, 19(3):299–310, 2006.
- Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7037–7041. IEEE, 2022a.
- Unimse: Towards unified multimodal sentiment analysis and emotion recognition. In Conference on Empirical Methods in Natural Language Processing, pages 7837–7851, 2022b.
- Higru: Hierarchical gated recurrent units for utterance-level emotion recognition. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 397–406, 2019.
- Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 4171–4186, 2019.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018.
- In the eye of transformer: Global-local correlation for egocentric gaze estimation. In The British Machine Vision Conference, 2022.
- Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. In Findings of the Association for Computational Linguistics, pages 6570–6588, 2023.
- Crossmodal clustered contrastive learning: Grounding of spoken language to gesture. In Companion Publication of International Conference on Multimodal Interaction, pages 202–210, 2021.
- Learning robust representations with information bottleneck and memory network for rgb-d-based gesture recognition. In IEEE/CVF International Conference on Computer Vision, pages 20968–20978, 2023.
- Decoupled representation learning for skeleton-based gesture recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5751–5760, 2020.
- Learning hierarchical cross-modal association for co-speech gesture generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. In ACM International Conference on Web Search and Data Mining, pages 735–745, 2022.
- Constructing a human-like agent for the werewolf game using a psychological model based multiple perspectives. In IEEE Symposium Series on Computational Intelligence, pages 1–8. IEEE, 2016.
- Interaction-aware joint attention estimation using people attributes. In IEEE/CVF International Conference on Computer Vision, pages 10224–10233, 2023.
- External commonsense knowledge as a modality for social intelligence question-answering. In IEEE/CVF International Conference on Computer Vision Workshop, pages 3044–3050, 2023.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4573–4577. IEEE, 2022.
- Dcr-net: A deep co-interactive relation network for joint dialog act recognition and sentiment classification. In AAAI Conference on Artificial Intelligence, pages 8665–8672, 2020.
- Co-gat: A co-interactive graph attention network for joint dialog act recognition and sentiment classification. In AAAI Conference on Artificial Intelligence, pages 13709–13717, 2021.
- Towards emotion-aided multi-modal dialogue act classification. In Annual Meeting of the Association for Computational Linguistics, pages 4361–4372, 2020.
- Finding friend and foe in multi-agent games. Advances in Neural Information Processing Systems, 32, 2019.
- Directed acyclic graph network for conversational emotion recognition. In Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing, pages 1551–1560, 2021.
- Attention flow: End-to-end joint attention estimation. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3327–3336, 2020.
- Vipul Raheja Joel Tetreault. Dialogue act classification with context-aware self-attention. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 3727–3733, 2019.
- Object-aware gaze target detection. In IEEE/CVF International Conference on Computer Vision, pages 21860–21869, 2023.
- End-to-end human-gaze-target detection with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2192–2200. IEEE, 2022.
- Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
- Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Annual Meeting of the Association for Computational Linguistics, pages 5635–5649, 2019.
- Multi-modal correlated network with emotional reasoning knowledge for social intelligence question-answering. In IEEE/CVF International Conference on Computer Vision Workshop, pages 3075–3081, 2023.
- Social-iq: A question answering benchmark for artificial social intelligence. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019.
- Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Annual Meeting of the Association for Computational Linguistics, pages 2236–2246, 2018.
- Attention in convolutional lstm for gesture recognition. Advances in Neural Information Processing Systems, 31, 2018.
- Knowledge-bridged causal interaction network for causal emotion entailment. In AAAI Conference on Artificial Intelligence, pages 14020–14028, 2023.
- Decoupling and recoupling spatiotemporal representation for rgb-d-based motion recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20154–20163, 2022.
- Topic-driven and knowledge-aware transformer for dialogue emotion detection. In Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language, pages 1571–1582. Association for Computational Linguistics, 2021.