Modality-Collaborative Transformer with Hybrid Feature Reconstruction for Robust Emotion Recognition (2312.15848v1)
Abstract: As a vital aspect of affective computing, Multimodal Emotion Recognition has been an active research area in the multimedia community. Despite recent progress, this field still confronts two major challenges in real-world applications: 1) improving the efficiency of constructing joint representations from unaligned multimodal features, and 2) relieving the performance decline caused by random modality feature missing. In this paper, we propose a unified framework, Modality-Collaborative Transformer with Hybrid Feature Reconstruction (MCT-HFR), to address these issues. The crucial component of MCT is a novel attention-based encoder which concurrently extracts and dynamically balances the intra- and inter-modality relations for all associated modalities. With additional modality-wise parameter sharing, a more compact representation can be encoded with less time and space complexity. To improve the robustness of MCT, we further introduce HFR which consists of two modules: Local Feature Imagination (LFI) and Global Feature Alignment (GFA). During model training, LFI leverages complete features as supervisory signals to recover local missing features, while GFA is designed to reduce the global semantic gap between pairwise complete and incomplete representations. Experimental evaluations on two popular benchmark datasets demonstrate that our proposed method consistently outperforms advanced baselines in both complete and incomplete data scenarios.
- Multimodal video sentiment analysis using deep learning approaches, a survey. Information Fusion 76 (2021), 204–226.
- Emotion recognition from multimodal physiological signals for emotion aware healthcare systems. Journal of Medical and Biological Engineering 40 (2020), 149–157.
- Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33, 12449–12460.
- OpenFace 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 59–66.
- BEiT: Bert pre-training of image Transformers. In 9th International Conference on Learning Representations.
- Christopher M Bishop and Nasser M Nasrabadi. 2006. Pattern recognition and machine learning. Vol. 4. Springer.
- IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (2008), 335–359.
- MSP-IMPROV: An acted corpus of dyadic interactions to study emotion perception. IEEE Transactions on Affective Computing 8, 1 (2016), 67–80.
- VGGFace2: A dataset for recognising faces across pose and age. In 2018 13th IEEE IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). IEEE, 67–74.
- Emerging properties in self-supervised vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9650–9660.
- Multimodal sentiment analysis with word-level fusion and reinforcement learning. In Proceedings of the 19th ACM International Conference on Multimodal Interaction. 163–171.
- WavLM: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505–1518.
- A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. PMLR, 1597–1607.
- Key-sparse Transformer for multimodal speech emotion recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6897–6901.
- David Chiang and Peter Cholak. 2022. Overcoming a Theoretical Limitation of Self-Attention. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 7654–7664.
- ELECTRA: Pre-training text encoders as discriminators rather than generators. In 8th International Conference on Learning Representations.
- BERT: Pre-training of deep bidirectional Transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 4171–4186.
- Semi-supervised deep generative modelling of incomplete multi-modality emotional data. In Proceedings of the 26th ACM International Conference on Multimedia. 108–116.
- Towards interpreting and mitigating shortcut learning behavior of NLU models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 915–929.
- Deep hierarchical fusion with application in Sentiment Analysis. In Proc. Interspeech 2019. 1646–1650.
- M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPT: MultiModal masking applied to sentiment analysis. In Proc. Interspeech 2021. 2876–2880.
- Ross Girshick. 2015. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision (ICCV).
- Online early-late fusion based on adaptive hmm for sign language recognition. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 1 (2017), 1–18.
- Dynamically adjust word representations using unaligned multimodal information. In Proceedings of the 30th ACM International Conference on Multimedia. 3394–3402.
- MM-Align: Learning optimal transport-based alignment dynamics for fast and accurate inference on missing modality sequences. arXiv preprint arXiv:2210.12798.
- MISA: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM International Conference on Multimedia. 1122–1131.
- Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 770–778.
- DeBERTa: Decoding-enhanced bert with disentangled attention. In 8th International Conference on Learning Representations.
- Using self-supervised learning can improve model robustness and uncertainty. Advances in Neural Information Processing Systems 32.
- HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 3451–3460.
- Multimodal Transformer fusion for continuous emotion recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3507–3511.
- TPFN: Applying outer product along time to multimodal sentiment analysis fusion on incomplete data. In Computer Vision–ECCV 2020. Springer, 431–447.
- Learning representations from imperfect time series data via tensor rank regularization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 1569–1576.
- Multimodal language analysis with recurrent multistage fusion. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 150–161.
- Efficient low-rank multimodal fusion with modality-specific factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2256.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization, In 7th International Conference on Learning Representations. arXiv preprint arXiv:1711.05101.
- Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2554–2562.
- SMIL: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2302–2310.
- Automatic speech emotion recognition using recurrent neural networks with local attention. In ICASSP 2017-2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2227–2231.
- Deep multimodal fusion for persuasiveness prediction. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. 284–288.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32.
- MFAS: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6966–6975.
- Deep contextualized word representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2227–2237.
- Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6892–6899.
- Towards emotion-aware recommender systems: an affective coherence model based on emotion-driven behaviors. Expert Systems with Applications 170 (2021), 114382.
- A review of affective computing: From unimodal analysis to multimodal fusion. Information Fusion 37 (2017), 98–125.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.
- Integrating multimodal information in large pretrained Transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2359–2369.
- Ensemble of svm trees for multimodal emotion recognition. In Proceedings of The 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 1–4.
- wav2vec: Unsupervised pre-Training for speech recognition. In Proc. Interspeech 2019. 3465–3469.
- Efficient Multimodal Transformer with Dual-Level Feature Restoration for Robust Multimodal Sentiment Analysis. IEEE Transactions on Affective Computing (2023), 1–17.
- Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992–8999.
- Missing modalities imputation via cascaded residual autoencoder. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1405–1414.
- Multimodal Transformer for unaligned multimodal language sequences. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6558–6569.
- Learning factorized multimodal representations. In 7th International Conference on Learning Representations.
- Multimodal embeddings from language models for emotion recognition in the wild. IEEE Signal Processing Letters 28 (2021), 608–612.
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. The Journal of Machine Learning Research 9, 11 (2008).
- Attention is all you need. Advances in Neural Information Processing Systems 30.
- Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216–7223.
- Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19163–19173.
- Recent trends in deep learning based natural language processing. IEEE Computational Intelligence Magazine 13, 3 (2018), 55–75.
- Transformer-based feature reconstruction network for robust multimodal sentiment analysis. In Proceedings of the 29th ACM International Conference on Multimedia. 4400–4407.
- Tensor fusion network for multimodal sentiment analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114.
- Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
- Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
- Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages. IEEE Intelligent Systems 31, 6 (2016), 82–88.
- Central moment discrepancy (CMD) for domain-invariant representation learning. In 5th International Conference on Learning Representations.
- Tag-assisted multimodal sentiment analysis under uncertain missing modalities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1545–1554.
- Deep partial multi-view learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 5 (2020), 2402–2415.
- Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 833–842.
- Feature fusion for multimodal emotion recognition based on deep canonical correlation analysis. IEEE Signal Processing Letters 28 (2021), 1898–1902.
- Missing modality imagination network for emotion recognition with uncertain missing modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2608–2618.
- Emotional chatting machine: Emotional conversation generation with internal and external memory. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
- Chengxin Chen (5 papers)
- Pengyuan Zhang (57 papers)