Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Language-guided Adaptive Hyper-modality Representation for Multimodal Sentiment Analysis (2310.05804v2)

Published 9 Oct 2023 in cs.AI, cs.CL, cs.CV, and cs.MM

Abstract: Though Multimodal Sentiment Analysis (MSA) proves effective by utilizing rich information from multiple sources (e.g., language, video, and audio), the potential sentiment-irrelevant and conflicting information across modalities may hinder the performance from being further improved. To alleviate this, we present Adaptive Language-guided Multimodal Transformer (ALMT), which incorporates an Adaptive Hyper-modality Learning (AHL) module to learn an irrelevance/conflict-suppressing representation from visual and audio features under the guidance of language features at different scales. With the obtained hyper-modality representation, the model can obtain a complementary and joint representation through multimodal fusion for effective MSA. In practice, ALMT achieves state-of-the-art performance on several popular datasets (e.g., MOSI, MOSEI and CH-SIMS) and an abundance of ablation demonstrates the validity and necessity of our irrelevance/conflict suppression mechanism.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Openface 2.0: Facial behavior analysis toolkit. In 13th IEEE International Conference on Automatic Face & Gesture Recognition, FG 2018, Xi’an, China, May 15-19, 2018, pages 59–66. IEEE Computer Society.
  2. End-to-end object detection with transformers. In Proceedings of the 16th European Conference on Computer Vision, volume 12346, pages 213–229.
  3. Speechformer: A hierarchical efficient framework incorporating the characteristics of speech. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, pages 346–350.
  4. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
  5. Dynamically adjust word representations using unaligned multimodal information. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 3394–3402. Association for Computing Machinery.
  6. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 9180–9192, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  7. MISA: modality-invariant and -specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM international conference on multimedia, pages 1122–1131. ACM.
  8. Multimodal transformer fusion for continuous emotion recognition. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing(ICASSP), pages 3507–3511. IEEE.
  9. A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition. Information Fusion, 53:209–221.
  10. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  11. Expression snippet transformer for robust video-based facial expression recognition. Pattern Recognition, 138:109368.
  12. Noise-resistant multimodal transformer for emotion recognition. arXiv preprint arXiv:2305.02814.
  13. Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2247–2256, Melbourne, Australia. Association for Computational Linguistics.
  14. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2554–2562. Computer Vision Foundation / IEEE.
  15. M-sena: An integrated platform for multimodal sentiment analysis. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 204–213.
  16. librosa: Audio and music signal analysis in python. In Proceedings of the 14th Python in Science Conference 2015 (SciPy 2015), Austin, Texas, July 6 - 12, 2015, pages 18–24. scipy.org.
  17. EARS: emotion-aware recommender system based on hybrid information fusion. Information Fusion, 46:141–146.
  18. Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2020, page 2359. NIH Public Access.
  19. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6558–6569, Florence, Italy. Association for Computational Linguistics.
  20. Learning factorized multimodal representations. In ICLR.
  21. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  22. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 5998–6008.
  23. Disentangled representation learning for multimodal emotion recognition. In Proceedings of the 30th ACM International Conference on Multimedia, pages 1642–1651. ACM.
  24. CH-SIMS: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 3718–3727. Association for Computational Linguistics.
  25. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10790–10797.
  26. Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In Proceedings of the 29th ACM International Conference on Multimedia, pages 4400–4407. ACM.
  27. Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 1103–1114, Copenhagen, Denmark. Association for Computational Linguistics.
  28. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 2236–2246.
  29. Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems, 31(6):82–88.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Haoyu Zhang (95 papers)
  2. Yu Wang (939 papers)
  3. Guanghao Yin (9 papers)
  4. Kejun Liu (5 papers)
  5. Yuanyuan Liu (75 papers)
  6. Tianshu Yu (38 papers)
Citations (16)