Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Joint Multimodal Transformer for Emotion Recognition in the Wild (2403.10488v3)

Published 15 Mar 2024 in cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011. Artif Intell Rev, 43(6):155–177.
  2. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.
  3. Mdn: A deep maximization-differentiation network for spatio-temporal depression detection. IEEE Tran. on Affective Computing.
  4. Iterative distillation for better uncertainty estimates in multitask emotion recognition. In ICCVW, pages 3550–3559.
  5. Audio-visual event localization via recursive fusion by joint co-attention. In WACV, pages 4012–4021.
  6. Ekman, P. (1992). An argument for basic emotions. Cognition and Emotion, 6(3-4):169–200.
  7. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  8. Multimodal transformer fusion for continuous emotion recognition. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3507–3511.
  9. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. IJCV, 127:907–929.
  10. Analysing affective behavior in the second abaw2 competition. In ICCVW, pages 3652–3660.
  11. Two-stream aural-visual affect analysis in the wild. In FG Workshop, pages 600–605.
  12. Two-stream aural-visual affect analysis in the wild. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 600–605. IEEE.
  13. A concordance correlation coefficient to evaluate reproducibility. Biometrics, pages 255–268.
  14. Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access, 11:14742–14751.
  15. Cross-attentional audio-visual fusion for weakly-supervised action localization. In ICLR.
  16. Multimodal deep learning. In Proceedings of the 28th International Conference on International Conference on Machine Learning, ICML’11, page 689–696, Madison, WI, USA. Omnipress.
  17. Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition. IEEE Trans. on Multimedia, pages 1–1.
  18. Emotion recognition using fusion of audio and video features. In SMC, pages 3847–3852.
  19. Training strategies to handle missing modalities for audio-visual expression recognition. In ICMI, page 400–404.
  20. Cross attentional audio-visual fusion for dimensional emotion recognition. In FG, pages 1–8.
  21. Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146:1–7.
  22. A pre-trained audio-visual transformer for emotion recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4698–4702.
  23. End-to-end multimodal affect recognition in real-world environments. Information Fusion, 68:46–53.
  24. End-to-end multimodal emotion recognition using deep neural networks. IEEE J. of Selected Topics in Signal Processing, 11(8):1301–1309.
  25. Continuous emotion recognition with audio-visual leader-follower attentive fusion. In ICCVW, pages 3560–3567.
  26. Multi-modal continuous valence-arousal estimation in the wild. In IEEE FG, pages 632–636.
  27. Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 5756–5763.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Paul Waligora (1 paper)
  2. Osama Zeeshan (2 papers)
  3. Haseeb Aslam (3 papers)
  4. Soufiane Belharbi (31 papers)
  5. Alessandro Lameiras Koerich (41 papers)
  6. Marco Pedersoli (81 papers)
  7. Simon Bacon (9 papers)
  8. Eric Granger (121 papers)
Citations (3)