Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A vector quantized masked autoencoder for audiovisual speech emotion recognition (2305.03568v2)

Published 5 May 2023 in cs.SD, cs.LG, cs.MM, and eess.AS

Abstract: The limited availability of labeled data is a major challenge in audiovisual speech emotion recognition (SER). Self-supervised learning approaches have recently been proposed to mitigate the need for labeled data in various applications. This paper proposes the VQ-MAE-AV model, a vector quantized masked autoencoder (MAE) designed for audiovisual speech self-supervised representation learning and applied to SER. Unlike previous approaches, the proposed method employs a self-supervised paradigm based on discrete audio and visual speech representations learned by vector quantized variational autoencoders. A multimodal MAE with self- or cross-attention mechanisms is proposed to fuse the audio and visual speech modalities and to learn local and global representations of the audiovisual speech sequence, which are then used for an SER downstream task. Experimental results show that the proposed approach, which is pre-trained on the VoxCeleb2 database and fine-tuned on standard emotional audiovisual speech datasets, outperforms the state-of-the-art audiovisual SER methods. Extensive ablation experiments are also provided to assess the contribution of the different model components.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. VATT: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems (NeurIPS), 34, 24206–24221.
  2. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems (NeurIPS), 33, 25–37.
  3. ViViT: A video vision transformer. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 6836–6846).
  4. Influence of lips on the production of vowels based on finite element simulations and experiments. The Journal of the Acoustical Society of America, 139, 2852–2859.
  5. MAE-AST: Masked autoencoding audio spectrogram transformer. In International Speech Communication Association (Interspeech) (pp. 2438–2442).
  6. MultiMAE: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision (ECCV) (pp. 348–367). Springer.
  7. BEiT: BERT pre-training of image transformers. In International Conference on Learning Representations (ICLR).
  8. Greedy layer-wise training of deep networks. Advances in Neural Information Processing Systems (NeurIPS), 19.
  9. How far are we from solving the 2D & 3D Face Alignment problem? (and a dataset of 230,000 3D facial landmarks). In IEEE/CVF International Conference on Computer Vision (ICCV).
  10. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42, 335–359.
  11. CREMA-D: Crowd-sourced emotional multimodal actors dataset. IEEE Transactions on Affective Computing, 5, 377–390.
  12. Exploring wav2vec 2.0 fine tuning for improved speech emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 1–5). IEEE.
  13. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML) (pp. 1597–1607). PMLR.
  14. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, .
  15. Self-attention fusion for audiovisual emotion recognition with incomplete data. In International Conference on Pattern Recognition (ICPR) (pp. 2822–2828). IEEE.
  16. VoxCeleb2: Deep speaker recognition. In International Speech Communication Association (Interspeech) (pp. 1086–1090). doi:10.21437/Interspeech.2018-1929.
  17. BERT: Pre-training of deep bidirectional transformers for language understanding. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), (pp. 4171–4186).
  18. S2f2: Self-supervised high fidelity face reconstruction from monocular image. In IEEE International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1–8). IEEE.
  19. PeCo: Perceptual codebook for BERT pre-training of vision transformers. arXiv preprint arXiv:2111.12710, .
  20. An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (ICLR), .
  21. Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern recognition, 44, 572–587.
  22. Taming transformers for high-resolution image synthesis. In IEEE/CVF conference on computer vision and pattern recognition (CVPR) (pp. 12873–12883).
  23. Masked autoencoders as spatiotemporal learners. Advances in Neural Information Processing Systems (NeurIPS), 35, 35946–35958.
  24. Nonnegative matrix factorization with the Itakura-Saito divergence: With application to music analysis. Neural computation, 21, 793–830.
  25. A cross-modal fusion network based on self-attention and residual structure for multimodal emotion recognition. arXiv preprint arXiv:2111.02172, .
  26. Music gesture for visual sound separation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 10478–10487).
  27. Co-separating sounds of visual objects. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 3879–3888).
  28. Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, .
  29. Multimodal attention-mechanism for temporal emotion recognition. In IEEE International Conference on Image Processing (ICIP) (pp. 251–255). IEEE.
  30. Multimodal and temporal perception of audio-visual cues for emotion recognition. In International Conference on Affective Computing and Intelligent Interaction (ACII) (pp. 552–558). IEEE.
  31. Unsupervised representation learning by predicting image rotations. In International Conference on Learning Representations (ICLR).
  32. Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features. IEEE Transactions on Affective Computing, 13, 2156–2170.
  33. SSAST: Self-supervised audio spectrogram transformer. In AAAI Conference on Artificial Intelligence (pp. 10699–10709). volume 36.
  34. Contrastive audio-visual masked autoencoder. In International Conference on Learning Representations (ICLR).
  35. Accurate, large minibatch SGD: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, .
  36. Masked autoencoders are scalable vision learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 16000–16009).
  37. Masked autoencoders that listen. Advances in Neural Information Processing Systems (NeurIPS), 35, 28708–28720.
  38. Contrastive masked autoencoders are stronger vision learners. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46, 1–13.
  39. SS-VAERR: Self-supervised apparent emotional reaction recognition from video. In IEEE International Conference on Automatic Face and Gesture Recognition (FG) (pp. 1–8). IEEE.
  40. MAGE: Masked generative encoder to unify representation learning and image synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 2142–2152).
  41. Audio self-supervised learning: A survey. Patterns, 3, 100616.
  42. Query2label: A simple transformer way to multi-label classification. preprint arXiv:2107.10834, .
  43. The Ryerson audio-visual database of emotional speech and song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13.
  44. Decoupled weight decay regularization. International Conference on Learning Representations (ICLR), .
  45. Mehrabian, A. (2017). Nonverbal communication. Routledge.
  46. Unsupervised learning of visual representations by solving jigsaw puzzles. In Computer Vision (ECCV) (pp. 69–84). Springer.
  47. Emotion recognition from speech using wav2vec 2.0 embeddings. International Speech Communication Association (Interspeech), (pp. 3400–3404).
  48. Deep multimodal learning: A survey on recent advances and trends. IEEE Signal Processing Magazine, 34, 96–108.
  49. Asymmetric loss for multi-label classification. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 82–91). IEEE Computer Society.
  50. A multimodal dynamical variational autoencoder for audiovisual speech representation learning. Neural Networks, 172, 106120. URL: https://www.sciencedirect.com/science/article/pii/S0893608024000340.
  51. A vector quantized masked autoencoder for speech emotion recognition. In IEEE ICASSP 2023 Workshop on Self-Supervision in Audio, Speech and Beyond (SASB).
  52. Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146, 1–7.
  53. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Advances in Neural Information Processing Systems (NeurIPS), 35, 10078–10093.
  54. Augmenting convolutional networks with attention-based aggregation. arXiv preprint arXiv:2112.13692, .
  55. A pre-trained audio-visual transformer for emotion recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4698–4702). IEEE.
  56. Multimodal transformer for unaligned multimodal language sequences. In Association for Computational Linguistics. Meeting (p. 6558). NIH Public Access volume 2019.
  57. Neural discrete representation learning. Advances in Neural Information Processing Systems (NeurIPS), 30.
  58. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30.
  59. A fine-tuned wav2vec 2.0/hubert benchmark for speech emotion recognition, speaker verification and spoken language understanding. arXiv preprint arXiv:2111.02735, .
  60. Simmim: A simple framework for masked image modeling. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (pp. 9653–9663).
  61. Vector-quantized image modeling with improved vqgan. In International Conference on Learning Representations (ICLR).
  62. A survey on masked autoencoder for self-supervised learning in vision and beyond. arXiv preprint arXiv:2208.00173, .
  63. The sound of motions. In IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 1735–1744).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Samir Sadok (6 papers)
  2. Simon Leglaive (24 papers)
  3. Renaud Séguier (11 papers)
Citations (5)