Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations (2403.02090v3)

Published 4 Mar 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However, most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently, they are limited in modeling the intricate dynamics of multi-party interactions. In this paper, we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification, pronoun coreference resolution, and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore, we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. No gestures left behind: Learning relationships between spoken language and freeform gestures. In Findings of Conference on Empirical Methods in Natural Language Processing, pages 1884–1895, 2020.
  2. Continual learning for personalized co-speech gesture generation. In IEEE/CVF International Conference on Computer Vision, pages 20893–20903, 2023.
  3. Data-free class-incremental hand gesture recognition. In IEEE/CVF International Conference on Computer Vision, pages 20958–20967, 2023.
  4. Emotional speech corpus for persuasive dialogue system. In Language Resources and Evaluation Conference, pages 491–497, 2020.
  5. Human-side strategies in the werewolf game against the stealth werewolf strategy. In International Conference on Computers and Games, pages 93–102. Springer, 2016.
  6. Mafia: A theoretical study of players and coalitions in a partial information environment. The Annals of Applied Probability, 18(3):825–846, 2008.
  7. Jean Carletta. Assessing agreement on classification tasks: the kappa statistic. Computational Linguistics, 22(2):249–254, 1996.
  8. Casino: A corpus of campsite negotiation dialogues for automatic negotiation systems. arXiv preprint arXiv:2103.15721, 2021.
  9. Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10761–10770, 2023.
  10. Are you awerewolf? detecting deceptive roles and outcomes in a conversational role-playing game. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5334–5337. IEEE, 2010.
  11. Detecting attended visual targets in video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5396–5406, 2020.
  12. Robert Chuchro. Training an assassin ai for the resistance: Avalon. arXiv preprint arXiv:2209.09331, 2022.
  13. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations, 2020.
  14. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378(6624):1067–1074, 2022.
  15. Multiscale vision transformers. In IEEE/CVF International Conference on Computer Vision, pages 6824–6835, 2021.
  16. Inferring shared attention in social scene videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6460–6468, 2018.
  17. Alphapose: Whole-body regional multi-person pose estimation and tracking in real-time. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  18. Dual attention guided gaze target detection in the wild. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11390–11399, 2021.
  19. Ego4d: Around the world in 3,000 hours of egocentric video. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022.
  20. A probabilistic model of gaze imitation and shared attention. Neural Networks, 19(3):299–310, 2006.
  21. Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7037–7041. IEEE, 2022a.
  22. Unimse: Towards unified multimodal sentiment analysis and emotion recognition. In Conference on Empirical Methods in Natural Language Processing, pages 7837–7851, 2022b.
  23. Higru: Hierarchical gated recurrent units for utterance-level emotion recognition. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 397–406, 2019.
  24. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 4171–4186, 2019.
  25. Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
  26. Klaus Krippendorff. Content analysis: An introduction to its methodology. Sage publications, 2018.
  27. In the eye of transformer: Global-local correlation for egocentric gaze estimation. In The British Machine Vision Conference, 2022.
  28. Werewolf among us: Multimodal resources for modeling persuasion behaviors in social deduction games. In Findings of the Association for Computational Linguistics, pages 6570–6588, 2023.
  29. Crossmodal clustered contrastive learning: Grounding of spoken language to gesture. In Companion Publication of International Conference on Multimodal Interaction, pages 202–210, 2021.
  30. Learning robust representations with information bottleneck and memory network for rgb-d-based gesture recognition. In IEEE/CVF International Conference on Computer Vision, pages 20968–20978, 2023.
  31. Decoupled representation learning for skeleton-based gesture recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5751–5760, 2020.
  32. Learning hierarchical cross-modal association for co-speech gesture generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10462–10472, 2022.
  33. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  34. Speaker and time-aware joint contextual learning for dialogue-act classification in counselling conversations. In ACM International Conference on Web Search and Data Mining, pages 735–745, 2022.
  35. Constructing a human-like agent for the werewolf game using a psychological model based multiple perspectives. In IEEE Symposium Series on Computational Intelligence, pages 1–8. IEEE, 2016.
  36. Interaction-aware joint attention estimation using people attributes. In IEEE/CVF International Conference on Computer Vision, pages 10224–10233, 2023.
  37. External commonsense knowledge as a modality for social intelligence question-answering. In IEEE/CVF International Conference on Computer Vision Workshop, pages 3044–3050, 2023.
  38. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  39. Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4573–4577. IEEE, 2022.
  40. Dcr-net: A deep co-interactive relation network for joint dialog act recognition and sentiment classification. In AAAI Conference on Artificial Intelligence, pages 8665–8672, 2020.
  41. Co-gat: A co-interactive graph attention network for joint dialog act recognition and sentiment classification. In AAAI Conference on Artificial Intelligence, pages 13709–13717, 2021.
  42. Towards emotion-aided multi-modal dialogue act classification. In Annual Meeting of the Association for Computational Linguistics, pages 4361–4372, 2020.
  43. Finding friend and foe in multi-agent games. Advances in Neural Information Processing Systems, 32, 2019.
  44. Directed acyclic graph network for conversational emotion recognition. In Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing, pages 1551–1560, 2021.
  45. Attention flow: End-to-end joint attention estimation. In IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3327–3336, 2020.
  46. Vipul Raheja Joel Tetreault. Dialogue act classification with context-aware self-attention. In Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pages 3727–3733, 2019.
  47. Object-aware gaze target detection. In IEEE/CVF International Conference on Computer Vision, pages 21860–21869, 2023.
  48. End-to-end human-gaze-target detection with transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2192–2200. IEEE, 2022.
  49. Attention is all you need. Advances in Neural Information Processing Systems, 30, 2017.
  50. Persuasion for good: Towards a personalized persuasive dialogue system for social good. In Annual Meeting of the Association for Computational Linguistics, pages 5635–5649, 2019.
  51. Multi-modal correlated network with emotional reasoning knowledge for social intelligence question-answering. In IEEE/CVF International Conference on Computer Vision Workshop, pages 3075–3081, 2023.
  52. Social-iq: A question answering benchmark for artificial social intelligence. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8807–8817, 2019.
  53. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Annual Meeting of the Association for Computational Linguistics, pages 2236–2246, 2018.
  54. Attention in convolutional lstm for gesture recognition. Advances in Neural Information Processing Systems, 31, 2018.
  55. Knowledge-bridged causal interaction network for causal emotion entailment. In AAAI Conference on Artificial Intelligence, pages 14020–14028, 2023.
  56. Decoupling and recoupling spatiotemporal representation for rgb-d-based motion recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20154–20163, 2022.
  57. Topic-driven and knowledge-aware transformer for dialogue emotion detection. In Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language, pages 1571–1582. Association for Computational Linguistics, 2021.
Citations (4)

Summary

  • The paper introduces innovative tasks like speaking target identification, pronoun resolution, and mentioned player prediction to capture multi-party social dynamics.
  • It presents a baseline that synergizes language and visual cues by leveraging advanced pre-trained models for enriched interaction understanding.
  • The study establishes robust benchmarks on social deduction games, paving the way for more socially aware AI applications in virtual assistance and robotics.

Advancing the Understanding of Multimodal Social Interactions

Introduction to the New Challenges in Social Interactions

In the domain of artificial intelligence and particularly within the paper of multimodal social interaction, considerable progress has been achieved. Yet, most efforts have focused on analyzing the behavior of isolated individuals or have not sufficiently captured the dynamics of multi-party interactions through both verbal and non-verbal cues. Recognizing the gap in current methodologies, a paper proposes to advance our understanding by introducing new tasks and a novel approach that emphasizes the need for densely aligned language-visual representations in modeling the dynamics of multi-party social interactions, particularly within the context of social deduction games.

Novel Tasks to Model Fine-Grained Social Dynamics

The paper introduces three innovative tasks aimed at enriching our perception of multimodal social interactions:

  1. Speaking target identification,
  2. Pronoun coreference resolution, and
  3. Mentioned player prediction,

These tasks were meticulously designed to challenge the conventional boundaries of understanding in multi-party social interactions by focusing on the aligned interpretation of verbal utterances and non-verbal cues (e.g., gestures, gazes) within the domain of social deduction games.

Methodology and Approach

To address these tasks, the authors develop a multimodal baseline model that synergizes densely aligned language-visual representations. This model facilitates the concurrent analysis of spoken utterances and corresponding visual cues (including gestures and gaze directions) to provide a comprehensive understanding of social interactions. The model employs an innovative technique for aligning player visuals with their spoken references, enabling a nuanced analysis that had previously been unattainable with holistic visual representations or single modality analyses alone.

Through extensive data annotations within social deduction game datasets, this paper not only creates robust challenges but also provides the necessary benchmarks to evaluate future approaches in this domain. The baseline model introduced leverages advanced machine learning techniques, including the use of pre-trained LLMs like BERT, RoBERTa, and ELECTRA, alongside innovative alignment strategies and visual feature extraction methods.

Key Findings and Contributions

The experimental results presented in the paper underscore the efficacy of the proposed approach, showcasing substantial improvements in task performance when leveraging the densely aligned multimodal representations over traditional or unimodal approaches. Specifically, these improvements are evidenced across all three introduced tasks, highlighting the significance of integrating both verbal and non-verbal cues through alignment in understanding social interactions.

Furthermore, the paper thoroughly examines the impact of various factors including visual feature types (gesture versus gaze features), the role of conversational context, and the effectiveness of permutation learning in enhancing model performance. These examinations shed light on the complex interplay between different elements of social interactions and underscore the nuanced understanding required to model them effectively.

Future Directions and Implications

This research not only sets a new benchmark for the paper of multimodal social interactions but also lays down the foundation for future advancements in developing socially aware artificial intelligence. The introduction of densely aligned multimodal representations opens new avenues for exploring the intricacies of human social behavior, paving the way for more naturalistic and effective AI systems capable of navigating the social world. Moreover, by releasing the benchmarks and source code, the research facilitates further exploration and innovation in this rapidly evolving field.

The proposed approach and its success in enhancing our understanding of multimodal social interactions hold promise not only for the further development of social artificial intelligence but also for practical applications in areas such as virtual assistance, social robotics, and beyond, where an in-depth understanding of human social dynamics is crucial.

In conclusion, this paper makes a significant contribution to the burgeoning field of social artificial intelligence by introducing robust tasks, a novel methodology, and offering insightful findings that collectively advance our understanding of the complexities of multimodal social interactions.

X Twitter Logo Streamline Icon: https://streamlinehq.com