Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data (2209.15329v3)

Published 30 Sep 2022 in cs.CL, cs.AI, and eess.AS

Abstract: How to boost speech pre-training with textual data is an unsolved problem due to the fact that speech and text are very different modalities with distinct characteristics. In this paper, we propose a cross-modal Speech and LLM (SpeechLM) to explicitly align speech and text pre-training with a pre-defined unified discrete representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities, including phoneme-unit and hidden-unit tokenizers, which can be trained using a small amount of paired speech-text data. Based on the trained tokenizers, we convert the unlabeled speech and text data into tokens of phoneme units or hidden units. The pre-training objective is designed to unify the speech and the text into the same discrete semantic space with a unified Transformer network. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB, demonstrating significant improvements on content-related tasks. Code and models are available at https://aka.ms/SpeechLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5723–5738, Dublin, Ireland. Association for Computational Linguistics.
  2. Unsupervised speech recognition. In Advances in Neural Information Processing Systems, volume 34, pages 27826–27839. Curran Associates, Inc.
  3. Data2vec: A general framework for self-supervised learning in speech, vision and language. arXiv preprint arXiv:2202.03555.
  4. vq-wav2vec: Self-supervised learning of discrete speech representations. In International Conference on Learning Representations (ICLR).
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems (NeurIPS).
  6. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374.
  7. Slam: A unified encoder for speech and language modeling via speech-text joint pre-training. arXiv preprint arXiv:2110.10329.
  8. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing.
  9. Maestro: Matched speech text representations through modality matching. arXiv preprint arXiv:2204.03409.
  10. An Unsupervised Autoregressive Model for Speech Representation Learning. In Interspeech, pages 146–150.
  11. Vector-quantized autoregressive predictive coding. In Interspeech, pages 3760–3764.
  12. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE.
  13. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  14. Unified language model pre-training for natural language understanding and generation. In Proceedings of the 33rd Conference on Neural Information Processing Systems, volume 32, pages 13063–13075.
  15. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. ICML ’06, page 369–376, New York, NY, USA. Association for Computing Machinery.
  16. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  17. Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8:64–77.
  18. Libri-light: A benchmark for asr with limited or no supervision. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7669–7673. IEEE.
  19. St-bert: Cross-modal language model pre-training for end-to-end spoken language understanding. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7478–7482.
  20. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  21. Shaoshi Ling and Yuzong Liu. 2020. DeCoAR 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659.
  22. Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406.
  23. Tera: Self-supervised learning of transformer encoder representation for speech. arXiv preprint arXiv:2007.06028.
  24. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE.
  25. Weighted finite-state transducers in speech recognition. Computer Speech & Language, 16(1):69–88.
  26. Librispeech: an asr corpus based on public domain audio books. In Proceedings of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210. IEEE.
  27. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  28. The kaldi speech recognition toolkit. IEEE Signal Processing Society. IEEE Catalog No.: CFP11SRW-USB.
  29. Speech-language pre-training for end-to-end spoken language understanding. In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 7458–7462.
  30. Multi-task self-supervised learning for robust speech recognition. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6989–6993. IEEE.
  31. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32.
  32. Unsupervised pretraining transfers well across languages. In International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7414–7418. IEEE.
  33. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
  34. Self-attention with relative position representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468.
  35. Unified speech-text pre-training for speech translation and recognition. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1488–1499, Dublin, Ireland. Association for Computational Linguistics.
  36. Pascale Tremblay and Anthony Steven Dick. 2016. Broca and wernicke are dead, or moving past the classic model of language neurobiology. Brain and language, 162:60–71.
  37. Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-sne. Journal of machine learning research, 9(11).
  38. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems, volume 30, pages 6000–6010.
  39. Covost 2 and massively multilingual speech-to-text translation. arXiv preprint arXiv:2007.10310.
  40. Large-scale self- and semi-supervised learning for speech translation. CoRR, abs/2104.06678.
  41. Supervision-guided codebooks for masked prediction in speech pre-training. arXiv preprint arXiv:2206.10125.
  42. Improving self-supervised learning for speech recognition with intermediate layer supervision. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 7092–7096. IEEE.
  43. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051.
  44. Speechut: Bridging speech and text with hidden-unit for encoder-decoder based speech-text pre-training. arXiv preprint arXiv:2210.03730.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Ziqiang Zhang (11 papers)
  2. Sanyuan Chen (28 papers)
  3. Long Zhou (57 papers)
  4. Yu Wu (196 papers)
  5. Shuo Ren (22 papers)
  6. Shujie Liu (101 papers)
  7. Zhuoyuan Yao (9 papers)
  8. Xun Gong (44 papers)
  9. Lirong Dai (31 papers)
  10. Jinyu Li (164 papers)
  11. Furu Wei (291 papers)
Citations (52)