Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition (2403.04245v1)

Published 7 Mar 2024 in cs.SD, cs.CV, cs.LG, cs.MM, and eess.AS

Abstract: Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames, performing even worse than single-modality models. While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input. In this paper, we investigate this contrasting phenomenon from the perspective of modality bias and reveal that an excessive modality bias on the audio caused by dropout is the underlying reason. Moreover, we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between modality bias and robustness against missing modality in multimodal systems. Building on these findings, we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality and to maintain performance and robustness simultaneously. Finally, to address an entirely missing modality, we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated and validated through a series of comprehensive experiments using the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAVSR

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. End-to-end audio-visual speech recognition with conformers. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7613–7617. IEEE, 2021.
  2. Leveraging unimodal self-supervised learning for multimodal audio-visual speech recognition, 2022.
  3. Leveraging modality-specific representations for audio-visual speech recognition via reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12607–12615, 2023.
  4. Discriminative multi-modality speech recognition. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 14433–14442, 2020.
  5. Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6984–6988. IEEE, 2020.
  6. Visual context-driven audio feature enhancement for robust end-to-end audio-visual speech recognition. arXiv preprint arXiv:2207.06020, 2022.
  7. Jointly learning visual and auditory speech representations from raw data. arXiv preprint arXiv:2212.06246, 2022.
  8. Auto-AVSR: Audio-visual speech recognition with automatic labels. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  9. Attention-based audio-visual fusion for robust automatic speech recognition. In Proceedings of the 20th ACM International conference on Multimodal Interaction, pages 111–115, 2018.
  10. How to teach DNNs to pay attention to the visual modality in speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:1052–1064, 2020.
  11. Cross-Modal Global Interaction and Local Alignment for Audio-Visual Speech Recognition. arXiv preprint arXiv:2305.09212, 2023.
  12. Lip reading sentences in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6447–6456, 2017.
  13. Deep audio-visual speech recognition. IEEE transactions on pattern analysis and machine intelligence, 44(12):8717–8727, 2018.
  14. LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496, 2018.
  15. Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18783–18794, 2023.
  16. Improving Audio-Visual Speech Recognition by Lip-Subword Correlation Based Visual Pre-training and Cross-Modal Fusion Encoder. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pages 2627–2632. IEEE, 2023.
  17. On robustness to missing video for audiovisual speech recognition. Transactions on Machine Learning Research (TMLR), 2022.
  18. Recurrent neural network transducer for audio-visual speech recognition. In 2019 IEEE automatic speech recognition and understanding workshop (ASRU), pages 905–912. IEEE, 2019.
  19. Learning audio-visual speech representation by masked multimodal cluster prediction, 2022.
  20. Analyzing modality robustness in multimodal sentiment analysis, 2022.
  21. Robust audio-visual speech recognition using bimodal DFSMN with multi-condition training and dropout regularization. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 6570–6574. IEEE, 2019.
  22. Missing modality imagination network for emotion recognition with uncertain missing modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2608–2618, 2021.
  23. Are multimodal transformers robust to missing modality? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18177–18186, 2022.
  24. The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation, 2022.
  25. Correlating subword articulation with lip shapes for embedding aware audio-visual speech enhancement. Neural Networks, 143:171–182, 2021.
  26. Modality attention for end-to-end audio-visual speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6565–6569. IEEE, 2019.
  27. Dearkd: data-efficient early knowledge distillation for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12052–12062, 2022.
  28. There is more than meets the eye: Self-supervised multi-object detection and tracking with sound by distilling multimodal knowledge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11612–11621, 2021.
  29. Multimodal knowledge expansion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 854–863, 2021.
  30. Correlation congruence for knowledge distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 5007–5016, 2019.
  31. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.
  32. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  33. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017.
  34. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
  35. Lora: Low-rank adaptation of large language models, 2021.
  36. The First Multimodal Information Based Speech Processing (Misp) Challenge: Data, Tasks, Baselines And Results. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9266–9270, 2022.
  37. The multimodal information based speech processing (misp) 2022 challenge: Audio-visual diarization and recognition. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  38. Intermediate loss regularization for ctc-based speech recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6224–6228. IEEE, 2021.
  39. The sjtu system for multimodal information based speech processing challenge 2021. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 9261–9265. IEEE, 2022.
  40. Channel-Wise AV-Fusion Attention for Multi-Channel Audio-Visual Speech Recognition. In Proc. ICASSP 2022, pages 9251–9255. IEEE, 2022.
  41. Lrw-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In 2019 14th IEEE international conference on automatic face & gesture recognition (FG 2019), pages 1–8. IEEE, 2019.
  42. Sang Wang Gaopeng Xu, Xianliang Wang et al. The NIO system for audio-visual diarization and recognition in MISP challenge 2022. https://mispchallenge.github.io/mispchallenge2022/papers/task2/Track2_NIO.pdf, 2022.
  43. The XMU System for Audio-Visual Diarization and Recognition in MISP Challenge 2022. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2023.
  44. Mlca-avsr: Multi-layer cross attention fusion based audio-visual speech recognition. arXiv preprint arXiv:2401.03424, 2024.
  45. Chime-6 challenge: Tackling multispeaker speech recognition for unsegmented recordings. arXiv preprint arXiv:2004.09249, 2020.
  46. The NPU-ASLP System for Audio-Visual Speech Recognition in MISP 2022 Challenge. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2023.
  47. Jonathan G Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In Proc. asrU 1997, pages 347–354. IEEE, 1997.
  48. The whu-alibaba audio-visual speaker diarization system for the misp 2022 challenge. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–2. IEEE, 2023.
  49. Metric Learning on Healthcare Data with Incomplete Modalities. In IJCAI, volume 3534, page 3540, 2019.
  50. Deep adversarial learning for multi-modality missing data completion. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining, pages 1158–1166, 2018.
  51. Transmodality: An end2end fusion method with transformer for multimodal sentiment analysis. In Proceedings of The Web Conference 2020, pages 2514–2520, 2020.
  52. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6892–6899, 2019.
  53. Mseg3d: Multi-modal 3d semantic segmentation for autonomous driving. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21694–21704, 2023.
  54. Noisex-92: A database and an experiment to study the effect of additive noise on speech recognition systems. Speech Commun, 12(3):247–253, 1993.
  55. Nara-wpe: A python package for weighted prediction error dereverberation in numpy and tensorflow for online and offline processing. In Speech Communication; 13th ITG-Symposium, pages 1–5. VDE, 2018.
  56. Front-end processing for the CHiME-5 dinner party scenario. In Proc. CHiME 2018, pages 35–40, 2018.
  57. GPU-accelerated guided source separation for meeting transcription, 2022.
  58. Audio-visual speech recognition in misp2021 challenge: Dataset release and deep analysis. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, volume 2022, pages 1766–1770, 2022.
  59. Smil: Multimodal learning with severely missing modality. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 2302–2310, 2021.
  60. Removing bias in multi-modal classifiers: Regularization by maximizing functional entropies. Advances in Neural Information Processing Systems, 33:3197–3208, 2020.
  61. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017.
  62. On modality bias recognition and reduction. ACM Transactions on Multimedia Computing, Communications and Applications, 19(3):1–22, 2023.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com