Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Properties of Speech Language Models (2404.00685v2)

Published 31 Mar 2024 in eess.AS, cs.AI, cs.CL, and cs.NE

Abstract: Speech LLMs (SLMs) aim to learn language from raw audio, without textual resources. Despite significant advances, our current models exhibit weak syntax and semantic abilities. However, if the scaling properties of neural LLMs hold for the speech modality, these abilities will improve as the amount of compute used for training increases. In this paper, we use models of this scaling behavior to estimate the scale at which our current methods will yield a SLM with the English proficiency of text-based LLMs. We establish a strong correlation between pre-training loss and downstream syntactic and semantic performance in SLMs and LLMs, which results in predictable scaling of linguistic performance. We show that the linguistic performance of SLMs scales up to three orders of magnitude more slowly than that of text-based LLMs. Additionally, we study the benefits of synthetic data designed to boost semantic understanding and the effects of coarser speech tokenization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Scaling laws for generative mixed-modal language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  2. The spoken wikipedia corpus collection: Harvesting, alignment and an application to hyperlistening. Lang. Resour. Eval., 53(2):303–329.
  3. Pythia: a suite for analyzing large language models across training and scaling. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  4. Audiolm: A language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:2523–2533.
  5. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901.
  6. Exploring speech recognition, translation, and understanding with discrete speech units: A comparative study.
  7. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250.
  8. J. Droppo and O. Elibol. 2021. Scaling laws for acoustic models. In Interspeech 2021.
  9. Ronen Eldan and Yuanzhi Li. 2023. Tinystories: How small can language models be and still speak coherent english?
  10. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  11. Textually pretrained speech language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  12. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer, pages 198–208, Cham. Springer International Publishing.
  13. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409.
  14. Training compute-optimal large language models.
  15. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Trans. Audio Speech Lang., 29:3451–3460.
  16. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673.
  17. Scaling laws for neural language models. CoRR, abs/2001.08361.
  18. Text-free prosody-aware generative spoken language modeling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8666–8681, Dublin, Ireland. Association for Computational Linguistics.
  19. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 66–71, Brussels, Belgium. Association for Computational Linguistics.
  20. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354.
  21. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  22. A corpus and cloze evaluation for deeper understanding of commonsense stories. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 839–849, San Diego, California. Association for Computational Linguistics.
  23. Scaling data-constrained language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  24. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. CoRR, abs/2011.11588.
  25. Expresso: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis. In Proc. INTERSPEECH 2023, pages 4823–4827.
  26. SpiRit-LM: Interleaved Spoken and Written Language Model.
  27. Librispeech: An asr corpus based on public domain audio books. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  28. Llama: Open and efficient foundation language models.
  29. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30.
  30. fairseq s^2: A scalable and integrable speech synthesis toolkit. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 143–152, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  31. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
  32. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Santiago Cuervo (8 papers)
  2. Ricard Marxer (21 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com