Joint Multimodal Transformer for Emotion Recognition in the Wild (2403.10488v3)
Abstract: Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.
- Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev, 43(6):155–177.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
- Mdn: A deep maximization-differentiation network for spatio-temporal depression detection. IEEE Tran. on Affective Computing.
- Iterative distillation for better uncertainty estimates in multitask emotion recognition. In ICCVW, pages 3550–3559.
- Audio-visual event localization via recursive fusion by joint co-attention. In WACV, pages 4012–4021.
- Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6(3-4):169–200.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
- Multimodal transformer fusion for continuous emotion recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3507–3511.
- Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. IJCV, 127:907–929.
- Analysing affective behavior in the second abaw2 competition. In ICCVW, pages 3652–3660.
- Two-stream aural-visual affect analysis in the wild. In FG Workshop, pages 600–605.
- Two-stream aural-visual affect analysis in the wild. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 600–605. IEEE.
- A concordance correlation coefficient to evaluate reproducibility. Biometrics, pages 255–268.
- Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access, 11:14742–14751.
- Cross-attentional audio-visual fusion for weakly-supervised action localization. In ICLR.
- Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 689–696, Madison, WI, USA. Omnipress.
- Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition. IEEE Trans. on Multimedia, pages 1–1.
- Emotion recognition using fusion of audio and video features. In SMC, pages 3847–3852.
- Training strategies to handle missing modalities for audio-visual expression recognition. In ICMI, page 400–404.
- Cross attentional audio-visual fusion for dimensional emotion recognition. In FG, pages 1–8.
- Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146:1–7.
- A pre-trained audio-visual transformer for emotion recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4698–4702.
- End-to-end multimodal affect recognition in real-world environments. Information Fusion, 68:46–53.
- End-to-end multimodal emotion recognition using deep neural networks. IEEE J. of Selected Topics in Signal Processing, 11(8):1301–1309.
- Continuous emotion recognition with audio-visual leader-follower attentive fusion. In ICCVW, pages 3560–3567.
- Multi-modal continuous valence-arousal estimation in the wild. In IEEE FG, pages 632–636.
- Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 5756–5763.
- Paul Waligora (1 paper)
- Osama Zeeshan (2 papers)
- Haseeb Aslam (3 papers)
- Soufiane Belharbi (31 papers)
- Alessandro Lameiras Koerich (41 papers)
- Marco Pedersoli (81 papers)
- Simon Bacon (9 papers)
- Eric Granger (121 papers)