Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition (2403.19822v1)

Published 28 Mar 2024 in cs.CL and cs.AI

Abstract: Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Tensorflow: A system for large-scale machine learning. In 12th {normal-{\{{USENIX}normal-}\}} symposium on operating systems design and implementation ({normal-{\{{OSDI}normal-}\}} 16), pages 265–283.
  2. Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv:1809.00496.
  3. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv:2104.11178.
  4. Self-supervised multimodal versatile networks. NeurIPS, 2(6):7.
  5. Vivit: A video vision transformer. arXiv:2103.15691.
  6. Layer normalization. arXiv:1607.06450.
  7. Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
  8. Joint unsupervised and supervised training for multilingual asr. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6402–6406. IEEE.
  9. Is space-time attention all you need for video understanding? arXiv:2102.05095.
  10. Alexander Bukharin and Tuo Zhao. 2023. Data diversity matters for robust instruction tuning. arXiv preprint arXiv:2311.14736.
  11. A short note about kinetics-600. arXiv:1808.01340.
  12. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308.
  13. David M. Chan and Shalini Ghosh. 2022. Content-context factorized representations for automated speech recognition.
  14. Multi-modal pre-training for automated speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 246–250. IEEE.
  15. Using external off-policy speech-to-text mappings in contextual end-to-end automated speech recognition.
  16. Speechstew: Simply mix all available speech recognition data to train one large neural network. arXiv:2104.02133.
  17. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  18. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
  19. Rnn-t models fail to generalize to out-of-domain audio: Causes and solutions. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 873–880. IEEE.
  20. Palm: Scaling language modeling with pathways. arXiv:2204.02311.
  21. Voxceleb2: Deep speaker recognition. arXiv:1806.05622.
  22. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE.
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
  24. MuST-C: a Multilingual Speech Translation Corpus. In NAACL 2019, pages 2012–2017.
  25. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929.
  26. Masked autoencoders as spatiotemporal learners. arXiv:2205.09113.
  27. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211.
  28. Trusted machine learning for probabilistic models. ICML Workshop on Reliable Machine Learning in the Wild.
  29. Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360.
  30. Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee.
  31. Conformer: Convolution-augmented transformer for speech recognition. arXiv:2005.08100.
  32. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv:2005.03191.
  33. Training compute-optimal large language models. arXiv:2203.15556.
  34. Wei-Ning Hsu and Bowen Shi. 2022. U-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In Advances in Neural Information Processing Systems.
  35. Hubert: How much can a bad teacher benefit asr pre-training? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6533–6537. IEEE.
  36. Ronghang Hu and Amanpreet Singh. 2021. Unit: Multimodal multitask learning with a unified transformer. arXiv:2102.10772.
  37. Damex: Dataset-aware mixture-of-experts for visual understanding of mixture-of-datasets. Advances in Neural Information Processing Systems, 36.
  38. Deploying self-supervised learning in the wild for hybrid automatic speech recognition. arXiv:2205.08598.
  39. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 449–456. IEEE.
  40. Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945.
  41. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
  42. Learning to discern: Imitating heterogeneous human demonstrations with preference and representation learning. In Conference on Robot Learning, pages 1437–1449. PMLR.
  43. Multilingual speech recognition using knowledge transfer across learning processes. arXiv:2110.07909.
  44. Developing rnn-t models surpassing high-performance hybrid models with customization capability. arXiv:2007.15188.
  45. Scalable and accurate self-supervised multimodal representation learning without aligned video and text data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops.
  46. Rethinking evaluation in asr: Are our models robust enough? arXiv:2010.11745.
  47. A unified framework for domain adaptation using metric learning on manifolds. CoRR, abs/1804.10834.
  48. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning, pages 1678–1690. PMLR.
  49. A teacher-student learning approach for unsupervised domain adaptation of sequence-trained asr models. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 250–257. IEEE.
  50. Unified modeling of multi-domain multi-device asr systems.
  51. Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6475–6479. IEEE.
  52. Toward domain-invariant speech recognition via large scale training. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 441–447. IEEE.
  53. Representation learning with contrastive predictive coding. arXiv:1807.03748.
  54. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
  55. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv:1904.08779.
  56. Learning problem-agnostic speech representations from multiple self-supervised tasks. arXiv:1904.03416.
  57. Combining subjective probabilities and data in training Markov Logic Networks. volume 7523, pages 90–105.
  58. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
  59. Robust speech recognition via large-scale weak supervision. arXiv:2212.04356.
  60. On the connection between pre-training data diversity and fine-tuning robustness. In Advances in Neural Information Processing Systems, volume 36, pages 66426–66437. Curran Associates, Inc.
  61. Contrastive learning of general-purpose audio representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3875–3879. IEEE.
  62. Wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862.
  63. Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv:2201.02184.
  64. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7464–7473.
  65. End-to-end asr: from supervised to semi-supervised learning with modern architectures. arXiv:1911.08460.
  66. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9.
  67. Joint masked cpc and ctc training for asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3045–3049. IEEE.
  68. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv:2203.12602.
  69. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access.
  70. An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621–633.
  71. Kostiantyn Tyshchenko et al. 2000. Metatheory of linguistics. Osnovy.
  72. Attention is all you need. Advances in neural information processing systems, 30.
  73. Unispeech: Unified speech representation learning with labeled and unlabeled data. In International Conference on Machine Learning, pages 10937–10947. PMLR.
  74. Multimodal self-supervised learning of general audio representations. arXiv:2104.12807.
  75. Student-teacher network learning with enhanced features. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5275–5279. IEEE.
  76. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv:2109.14084.
  77. Superb: Speech processing universal performance benchmark. arXiv:2105.01051.
  78. Regularize, expand and compress: Nonexpansive continual learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
  79. Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7829–7833. IEEE.
  80. Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
  81. Semi-supervised end-to-end asr via teacher-student learning with conditional posterior distribution. In INTERSPEECH, pages 3580–3584.
  82. Simple multi-dataset detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7571–7580.
Citations (2)

Summary

We haven't generated a summary for this paper yet.