Multi-Stage Multi-Modal Pre-Training for Automatic Speech Recognition (2403.19822v1)
Abstract: Recent advances in machine learning have demonstrated that multi-modal pre-training can improve automatic speech recognition (ASR) performance compared to randomly initialized models, even when models are fine-tuned on uni-modal tasks. Existing multi-modal pre-training methods for the ASR task have primarily focused on single-stage pre-training where a single unsupervised task is used for pre-training followed by fine-tuning on the downstream task. In this work, we introduce a novel method combining multi-modal and multi-task unsupervised pre-training with a translation-based supervised mid-training approach. We empirically demonstrate that such a multi-stage approach leads to relative word error rate (WER) improvements of up to 38.45% over baselines on both Librispeech and SUPERB. Additionally, we share several important findings for choosing pre-training methods and datasets.
- Tensorflow: A system for large-scale machine learning. In 12th {normal-{\{{USENIX}normal-}\}} symposium on operating systems design and implementation ({normal-{\{{OSDI}normal-}\}} 16), pages 265–283.
- Lrs3-ted: a large-scale dataset for visual speech recognition. arXiv:1809.00496.
- Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. arXiv:2104.11178.
- Self-supervised multimodal versatile networks. NeurIPS, 2(6):7.
- Vivit: A video vision transformer. arXiv:2103.15691.
- Layer normalization. arXiv:1607.06450.
- Wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460.
- Joint unsupervised and supervised training for multilingual asr. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6402–6406. IEEE.
- Is space-time attention all you need for video understanding? arXiv:2102.05095.
- Alexander Bukharin and Tuo Zhao. 2023. Data diversity matters for robust instruction tuning. arXiv preprint arXiv:2311.14736.
- A short note about kinetics-600. arXiv:1808.01340.
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308.
- David M. Chan and Shalini Ghosh. 2022. Content-context factorized representations for automated speech recognition.
- Multi-modal pre-training for automated speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 246–250. IEEE.
- Using external off-policy speech-to-text mappings in contextual end-to-end automated speech recognition.
- Speechstew: Simply mix all available speech recognition data to train one large neural network. arXiv:2104.02133.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR.
- Rnn-t models fail to generalize to out-of-domain audio: Causes and solutions. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 873–880. IEEE.
- Palm: Scaling language modeling with pathways. arXiv:2204.02311.
- Voxceleb2: Deep speaker recognition. arXiv:1806.05622.
- W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
- MuST-C: a Multilingual Speech Translation Corpus. In NAACL 2019, pages 2012–2017.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929.
- Masked autoencoders as spatiotemporal learners. arXiv:2205.09113.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211.
- Trusted machine learning for probabilistic models. ICML Workshop on Reliable Machine Learning in the Wild.
- Vision models are more robust and fair when pretrained on uncurated images without supervision. arXiv preprint arXiv:2202.08360.
- Speech recognition with deep recurrent neural networks. In 2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv:2005.08100.
- Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. arXiv:2005.03191.
- Training compute-optimal large language models. arXiv:2203.15556.
- Wei-Ning Hsu and Bowen Shi. 2022. U-hubert: Unified mixed-modal speech pretraining and zero-shot transfer to unlabeled modality. In Advances in Neural Information Processing Systems.
- Hubert: How much can a bad teacher benefit asr pre-training? In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6533–6537. IEEE.
- Ronghang Hu and Amanpreet Singh. 2021. Unit: Multimodal multitask learning with a unified transformer. arXiv:2102.10772.
- Damex: Dataset-aware mixture-of-experts for visual understanding of mixture-of-datasets. Advances in Neural Information Processing Systems, 36.
- Deploying self-supervised learning in the wild for hybrid automatic speech recognition. arXiv:2205.08598.
- A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 449–456. IEEE.
- Droid: A large-scale in-the-wild robot manipulation dataset. arXiv preprint arXiv:2403.12945.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv:1412.6980.
- Learning to discern: Imitating heterogeneous human demonstrations with preference and representation learning. In Conference on Robot Learning, pages 1437–1449. PMLR.
- Multilingual speech recognition using knowledge transfer across learning processes. arXiv:2110.07909.
- Developing rnn-t models surpassing high-performance hybrid models with customization capability. arXiv:2007.15188.
- Scalable and accurate self-supervised multimodal representation learning without aligned video and text data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops.
- Rethinking evaluation in asr: Are our models robust enough? arXiv:2010.11745.
- A unified framework for domain adaptation using metric learning on manifolds. CoRR, abs/1804.10834.
- What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning, pages 1678–1690. PMLR.
- A teacher-student learning approach for unsupervised domain adaptation of sequence-trained asr models. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 250–257. IEEE.
- Unified modeling of multi-domain multi-device asr systems.
- Improving noise robustness of automatic speech recognition via parallel data and teacher-student learning. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6475–6479. IEEE.
- Toward domain-invariant speech recognition via large scale training. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 441–447. IEEE.
- Representation learning with contrastive predictive coding. arXiv:1807.03748.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE.
- Specaugment: A simple data augmentation method for automatic speech recognition. arXiv:1904.08779.
- Learning problem-agnostic speech representations from multiple self-supervised tasks. arXiv:1904.03416.
- Combining subjective probabilities and data in training Markov Logic Networks. volume 7523, pages 90–105.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763. PMLR.
- Robust speech recognition via large-scale weak supervision. arXiv:2212.04356.
- On the connection between pre-training data diversity and fine-tuning robustness. In Advances in Neural Information Processing Systems, volume 36, pages 66426–66437. Curran Associates, Inc.
- Contrastive learning of general-purpose audio representations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3875–3879. IEEE.
- Wav2vec: Unsupervised pre-training for speech recognition. arXiv:1904.05862.
- Learning audio-visual speech representation by masked multimodal cluster prediction. arXiv:2201.02184.
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7464–7473.
- End-to-end asr: from supervised to semi-supervised learning with modern architectures. arXiv:1911.08460.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9.
- Joint masked cpc and ctc training for asr. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3045–3049. IEEE.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. arXiv:2203.12602.
- Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, page 6558. NIH Public Access.
- An empirical study on robustness to spurious correlations using pre-trained language models. Transactions of the Association for Computational Linguistics, 8:621–633.
- Kostiantyn Tyshchenko et al. 2000. Metatheory of linguistics. Osnovy.
- Attention is all you need. Advances in neural information processing systems, 30.
- Unispeech: Unified speech representation learning with labeled and unlabeled data. In International Conference on Machine Learning, pages 10937–10947. PMLR.
- Multimodal self-supervised learning of general audio representations. arXiv:2104.12807.
- Student-teacher network learning with enhanced features. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5275–5279. IEEE.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv:2109.14084.
- Superb: Speech processing universal performance benchmark. arXiv:2105.01051.
- Regularize, expand and compress: Nonexpansive continual learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV).
- Transformer transducer: A streamable speech recognition model with transformer encoders and rnn-t loss. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7829–7833. IEEE.
- Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037.
- Semi-supervised end-to-end asr via teacher-student learning with conditional posterior distribution. In INTERSPEECH, pages 3580–3584.
- Simple multi-dataset detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7571–7580.