Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MIPS at SemEval-2024 Task 3: Multimodal Emotion-Cause Pair Extraction in Conversations with Multimodal Language Models (2404.00511v3)

Published 31 Mar 2024 in cs.CL, cs.CV, and cs.MM

Abstract: This paper presents our winning submission to Subtask 2 of SemEval 2024 Task 3 on multimodal emotion cause analysis in conversations. We propose a novel Multimodal Emotion Recognition and Multimodal Emotion Cause Extraction (MER-MCE) framework that integrates text, audio, and visual modalities using specialized emotion encoders. Our approach sets itself apart from top-performing teams by leveraging modality-specific features for enhanced emotion understanding and causality inference. Experimental evaluation demonstrates the advantages of our multimodal approach, with our submission achieving a competitive weighted F1 score of 0.3435, ranking third with a margin of only 0.0339 behind the 1st team and 0.0025 behind the 2nd team. Project: https://github.com/MIPS-COLT/MER-MCE.git

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018), pages 59–66. IEEE.
  2. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
  3. Semi-supervised multimodal emotion recognition with expression mae. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9436–9440.
  4. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
  6. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  8. Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE.
  9. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  10. Instructerc: Reforming emotion recognition in conversation with a retrieval multi-task llms framework. arXiv preprint arXiv:2309.11911.
  11. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
  12. Ecpec: emotion-cause pair extraction in conversations. IEEE Transactions on Affective Computing.
  13. Emocaps: Emotion capsule based model for conversational emotion recognition. arXiv preprint arXiv:2203.13504.
  14. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. arXiv preprint arXiv:2304.08981.
  15. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  16. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508.
  17. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
  18. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  19. Multimodal emotion-cause pair extraction in conversations. IEEE Transactions on Affective Computing, 14(3):1832–1844.
  20. Semeval-2024 task 3: Multimodal emotion cause analysis in conversations. In Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024).
  21. Rui Xia and Zixiang Ding. 2019. Emotion-cause pair extraction: A new task to emotion analysis in texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1003–1012.
  22. Facechain-imagineid: Freely crafting high-fidelity diverse talking faces from disentangled audio. arXiv preprint arXiv:2403.01901.
  23. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in neural information processing systems, 32.
  24. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30:6544–6556.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zebang Cheng (10 papers)
  2. Fuqiang Niu (9 papers)
  3. Yuxiang Lin (7 papers)
  4. Zhi-Qi Cheng (61 papers)
  5. Bowen Zhang (161 papers)
  6. Xiaojiang Peng (59 papers)
Citations (5)