SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data (2209.15329v3)
Abstract: How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and LLM (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.
- SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5723–5738, Dublin, Ireland. Association for Computational Linguistics.
- Unsupervised speech recognition. In Advances in Neural Information Processing Systems, volume 34, pages 27826–27839. Curran Associates, Inc.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
- vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations (ICLR).
- wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems (NeurIPS).
- mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374.
- Slam: A unified encoder for speech and language modeling via speech-text joint pre-training. arXiv preprint arXiv:2110.10329.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing.
- Maestro: Matched speech text representations through modality matching. arXiv preprint arXiv:2204.03409.
- An Unsupervised Autoregressive Model for Speech Representation Learning. In Interspeech, pages 146–150.
- Vector-quantized autoregressive predictive coding. In Interspeech, pages 3760–3764.
- W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd Conference on Neural Information Processing Systems, volume 32, pages 13063–13075.
- Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML ’06, page 369–376, New York, NY, USA. Association for Computing Machinery.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
- Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
- Libri-light: A benchmark for asr with limited or no supervision. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7669–7673. IEEE.
- St-bert: Cross-modal language model pre-training for end-to-end spoken language understanding. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7478–7482.
- Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Shaoshi Ling and Yuzong Liu. 2020. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659.
- Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406.
- Tera: Self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028.
- Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
- Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1):69–88.
- Librispeech: an asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210. IEEE.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- The kaldi speech recognition toolkit. IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.
- Speech-language pre-training for end-to-end spoken language understanding. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7458–7462.
- Multi-task self-supervised learning for robust speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6989–6993. IEEE.
- Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32.
- Unsupervised pretraining transfers well across languages. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418. IEEE.
- wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
- Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468.
- Unified speech-text pre-training for speech translation and recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1488–1499, Dublin, Ireland. Association for Computational Linguistics.
- Pascale Tremblay and Anthony Steven Dick. 2016. Broca and wernicke are dead, or moving past the classic model of language neurobiology. Brain and language, 162:60–71.
- Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
- Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, volume 30, pages 6000–6010.
- Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310.
- Large-scale self- and semi-supervised learning for speech translation. CoRR, abs/2104.06678.
- Supervision-guided codebooks for masked prediction in speech pre-training. arXiv preprint arXiv:2206.10125.
- Improving self-supervised learning for speech recognition with intermediate layer supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 7092–7096. IEEE.
- Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051.
- Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. arXiv preprint arXiv:2210.03730.
- Ziqiang Zhang (11 papers)
- Sanyuan Chen (28 papers)
- Long Zhou (57 papers)
- Yu Wu (196 papers)
- Shuo Ren (22 papers)
- Shujie Liu (101 papers)
- Zhuoyuan Yao (9 papers)
- Xun Gong (44 papers)
- Lirong Dai (31 papers)
- Jinyu Li (164 papers)
- Furu Wei (291 papers)