Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A multimodal dynamical variational autoencoder for audiovisual speech representation learning (2305.03582v3)

Published 5 May 2023 in cs.SD, cs.LG, cs.MM, and eess.AS

Abstract: In this paper, we present a multimodal and dynamical VAE (MDVAE) applied to unsupervised audio-visual speech representation learning. The latent space is structured to dissociate the latent dynamical factors that are shared between the modalities from those that are specific to each modality. A static latent variable is also introduced to encode the information that is constant over time within an audiovisual speech sequence. The model is trained in an unsupervised manner on an audiovisual emotional speech dataset, in two stages. In the first stage, a vector quantized VAE (VQ-VAE) is learned independently for each modality, without temporal modeling. The second stage consists in learning the MDVAE model on the intermediate representation of the VQ-VAEs before quantization. The disentanglement between static versus dynamical and modality-specific versus modality-common information occurs during this second training stage. Extensive experiments are conducted to investigate how audiovisual speech latent factors are encoded in the latent space of MDVAE. These experiments include manipulating audiovisual speech, audiovisual facial image denoising, and audiovisual speech emotion recognition. The results show that MDVAE effectively combines the audio and visual information in its latent space. They also show that the learned static representation of audiovisual speech can be used for emotion recognition with few labeled data, and with better accuracy compared with unimodal baselines and a state-of-the-art supervised model based on an audiovisual transformer architecture.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, .
  2. Influence of lips on the production of vowels based on finite element simulations and experiments. The Journal of the Acoustical Society of America, 139, 2852–2859.
  3. Multimae: Multi-modal multi-task masked autoencoders. In European Conference on Computer Vision (pp. 348–367). Springer.
  4. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41, 423–443.
  5. Openface: an open source facial behavior analysis toolkit. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV) (pp. 1–10). IEEE.
  6. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35, 1798–1828.
  7. Correlated expression of the body, face, and voice during character portrayal in actors. Scientific Reports, 12, 1–13.
  8. Pattern recognition and machine learning volume 4. Springer.
  9. Praat: doing phonetics by computer [computer program](2011). Version, 5, 74.
  10. Isolating sources of disentanglement in variational autoencoders. Advances in neural information processing systems, 31.
  11. Self-attention fusion for audiovisual emotion recognition with incomplete data. In International Conference on Pattern Recognition (ICPR) (pp. 2822–2828). IEEE.
  12. Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30.
  13. On the limitations of multimodal vaes. In International Conference on Learning Representations (ICLR).
  14. Facial action coding system. Environmental Psychology & Nonverbal Behavior, .
  15. Nonnegative matrix factorization with the itakura-saito divergence: With application to music analysis. Neural computation, 21, 793–830.
  16. Modality-general and modality-specific audiovisual valence processing. Cortex, 138, 127–137.
  17. Identifying independence in bayesian networks. Networks, 20, 507–534.
  18. Dynamical variational autoencoders: A comprehensive review. Foundations and Trends in Machine Learning, 15, 1–175.
  19. Generative adversarial nets. Advances in neural information processing systems, 27.
  20. beta-vae: Learning basic visual concepts with a constrained variational framework, .
  21. Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural computation, 14, 1771–1800.
  22. End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 2352–2356). IEEE.
  23. Improving variational autoencoder with deep feature consistent and generative adversarial training. Neurocomputing, 341, 183–194.
  24. Disentangling by partitioning: A representation learning framework for multimodal sensory data. ArXiv preprint arXiv:1805.11264.
  25. An introduction to variational methods for graphical models. Machine learning, 37, 183–233.
  26. Optimal transport-based identity matching for identity-invariant facial expression recognition. In Advances in Neural Information Processing Systems.
  27. Disentangling by factorising. In International Conference on Machine Learning (ICML) (pp. 2649–2658). PMLR.
  28. Crepe: A convolutional representation for pitch estimation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 161–165). IEEE.
  29. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR).
  30. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
  31. Learning latent subspaces in variational autoencoders. Advances in neural information processing systems, 31.
  32. Autoencoding beyond pixels using a learned similarity metric. In International conference on machine learning (ICML) (pp. 1558–1566). PMLR.
  33. Lazarus, A. A. (1976). Multimodal therapy. Handbook of Psychotherapy Integration, (p. 105).
  34. Sdr–half-baked or well done? In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 626–630). IEEE.
  35. Private-shared disentangled multimodal vae for learning of hybrid latent representations. ArXiv preprint arXiv:2012.13024.
  36. Disentangled sequential autoencoder. ArXiv preprint arXiv:1803.02991.
  37. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13, e0196391.
  38. Mosnet: Deep learning based objective assessment for voice conversion. ArXiv preprint arXiv:1904.08352.
  39. Challenging common assumptions in the unsupervised learning of disentangled representations. In International conference on machine learning (ICML) (pp. 4114–4124). PMLR.
  40. Weakly-supervised disentanglement without compromises. In International Conference on Machine Learning (ICML) (pp. 6348–6359). PMLR.
  41. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10, 18–31.
  42. Pyfeat: a python-based effective feature generation tool for dna, rna and protein sequences. Bioinformatics, 35, 3831–3833.
  43. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models (pp. 355–368). Springer.
  44. Audio-visual emotion recognition in video clips. IEEE Transactions on Affective Computing, 10, 60–75.
  45. Librispeech: an asr corpus based on public domain audio books. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5206–5210). IEEE.
  46. Emotion recognition from speech using wav2vec 2.0 embeddings. Interspeech, (pp. 3400–3404).
  47. End-to-end audiovisual speech recognition. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 6548–6552). IEEE.
  48. Facial expression recognition using residual masking network. In International Conference on Pattern Recognition (ICPR) (pp. 4513–4519). IEEE.
  49. Improving image autoencoder embeddings with perceptual loss. In 2020 International Joint Conference on Neural Networks (IJCNN) (pp. 1–7). IEEE.
  50. Deep multimodal learning: A survey on recent advances and trends. IEEE signal processing magazine, 34, 96–108.
  51. Generating diverse high-fidelity images with vq-vae-2. Advances in neural information processing systems, 32.
  52. Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning (ICML) (pp. 1278–1286). PMLR.
  53. Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs. In IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 01CH37221) (pp. 749–752). IEEE volume 2.
  54. Ava active speaker: An audio-visual dataset for active speaker detection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4492–4496). IEEE.
  55. Learning and controlling the source-filter representation of speech with a variational autoencoder. Speech Communication, 148, 53–65.
  56. wav2vec: Unsupervised pre-training for speech recognition. Interspeech, (pp. 3465–3469).
  57. Leveraging recent advances in deep learning for audio-visual emotion recognition. Pattern Recognition Letters, 146, 1–7.
  58. Variational mixture-of-experts autoencoders for multi-modal deep generative models. Advances in Neural Information Processing Systems, 32.
  59. Multimodal generative learning utilizing jensen-shannon-divergence. Advances in Neural Information Processing Systems, 33, 6100–6110.
  60. Generalized multimodal elbo. In International Conference on Learning Representations (ICLR).
  61. A survey of multimodal deep generative models. Advanced Robotics, 36, 261–278.
  62. Joint multimodal learning with deep generative models. ArXiv preprint arXiv:1611.01891.
  63. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In IEEE international conference on acoustics, speech and signal processing (pp. 4214–4217). IEEE.
  64. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting (p. 6558). NIH Public Access volume 2019.
  65. Nvae: A deep hierarchical variational autoencoder. Advances in Neural Information Processing Systems, 33, 19667–19679.
  66. Neural discrete representation learning. Advances in neural information processing systems, 30.
  67. Are disentangled representations helpful for abstract visual reasoning? Advances in Neural Information Processing Systems, 32.
  68. Mead: A large-scale audio-visual dataset for emotional talking-face generation. In European Conference on Computer Vision (ECCV) (pp. 700–717). Springer.
  69. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13, 600–612.
  70. Unsupervised domain adaptation with regularized optimal transport for multimodal 2d+ 3d facial expression recognition. In IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (pp. 31–37). IEEE.
  71. Survey on audiovisual emotion recognition: databases, features, and data fusion strategies. APSIPA transactions on signal and information processing, 3, e12.
  72. Multimodal generative models for scalable weakly-supervised learning. Advances in Neural Information Processing Systems, 31.
  73. Robust lightweight facial expression recognition network with label distribution training. In Conference on artificial intelligence (AAAI) (pp. 3510–3519). volume 35.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Samir Sadok (6 papers)
  2. Simon Leglaive (24 papers)
  3. Laurent Girin (40 papers)
  4. Xavier Alameda-Pineda (69 papers)
  5. Renaud Séguier (11 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com