Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative Spoken Language Model based on continuous word-sized audio tokens (2310.05224v1)

Published 8 Oct 2023 in cs.CL and cs.LG

Abstract: In NLP, text LLMs based on words or subwords are known to outperform their character-based counterparts. Yet, in the speech community, the standard input of spoken LMs are 20ms or 40ms-long discrete units (shorter than a phoneme). Taking inspiration from word-based LM, we introduce a Generative Spoken LLM (GSLM) based on word-size continuous-valued audio embeddings that can generate diverse and expressive language output. This is obtained by replacing lookup table for lexical types with a Lexical Embedding function, the cross entropy loss by a contrastive loss, and multinomial sampling by k-NN sampling. The resulting model is the first generative LLM based on word-size continuous embeddings. Its performance is on par with discrete unit GSLMs regarding generation quality as measured by automatic metrics and subjective human judgements. Moreover, it is five times more memory efficient thanks to its large 200ms units. In addition, the embeddings before and after the Lexical Embedder are phonetically and semantically interpretable.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Speech sequence embeddings using nearest neighbors contrastive learning.
  2. Dp-parse: Finding word boundaries from raw speech with an instance lexicon.
  3. vq-wav2vec: Self-supervised learning of discrete speech representations. CoRR, abs/1910.05453.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. CoRR, abs/2006.11477.
  5. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
  6. Chris Biemann. 2006. Chinese whispers - an efficient graph clustering algorithm and its application to natural language processing problems. In Proceedings of TextGraphs: the First Workshop on Graph Based Methods for Natural Language Processing, pages 73–80, New York City. Association for Computational Linguistics.
  7. Alternative structures for character-level rnns. CoRR, abs/1511.06303.
  8. Audiolm: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143.
  9. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–480.
  10. Mingjie Chen and Thomas Hain. 2020. Unsupervised Acoustic Unit Representation Learning for Voice Conversion Using WaveNet Auto-Encoders. In Proc. Interspeech 2020, pages 4866–4870.
  11. Palm: Scaling language modeling with pathways.
  12. Yu-An Chung and James R. Glass. 2018. Speech2vec: A sequence-to-sequence framework for learning word embeddings from speech. CoRR, abs/1803.08976.
  13. The zero resource speech challenge 2019: TTS without T. CoRR, abs/1904.11469.
  14. Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge. IEEE Journal of Selected Topics in Signal Processing.
  15. Self-supervised language learning from raw audio: Lessons from the zero resource speech challenge. IEEE Journal of Selected Topics in Signal Processing, 16(6):1211–1226.
  16. Do coarser units benefit cluster prediction-based speech pre-training? In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  17. Combining Adversarial Training and Disentangled Speech Representation for Robust Zero-Resource Subword Modeling. In Proc. Interspeech 2019, pages 1093–1097.
  18. Philip Gage. 1994. A new algorithm for data compression. The C Users Journal archive, 12:23–38.
  19. A bayesian framework for word segmentation: Exploring the effects of context. Cognition, 112:21–54.
  20. Alex Graves. 2013. Generating sequences with recurrent neural networks. CoRR, abs/1308.0850.
  21. Michael Gutmann and Aapo Hyvärinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, volume 9 of Proceedings of Machine Learning Research, pages 297–304, Chia Laguna Resort, Sardinia, Italy. PMLR.
  22. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. CoRR, abs/2106.07447.
  23. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis.
  24. Keith Ito and Linda Johnson. 2017. The lj speech dataset.
  25. Acoustic word embeddings for zero-resource languages using self-supervised contrastive learning and multilingual adaptation. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 919–926.
  26. Billion-scale similarity search with gpus. CoRR, abs/1702.08734.
  27. Libri-light: A benchmark for ASR with limited or no supervision. CoRR, abs/1912.07875.
  28. Herman Kamper. 2018. Truly unsupervised acoustic word embeddings using weak top-down constraints in encoder-decoder models. CoRR, abs/1811.00403.
  29. Herman Kamper. 2022. Word segmentation on discovered phone units with dynamic programming and self-supervised scoring.
  30. textless-lib: a library for textless spoken language processing.
  31. Text-free prosody-aware generative spoken language modeling. CoRR, abs/2109.03264.
  32. Text-free prosody-aware generative spoken language modeling. arXiv preprint arXiv:2109.03264.
  33. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. CoRR, abs/2010.05646.
  34. Reducing activation recomputation in large transformer models.
  35. Self-supervised contrastive learning for unsupervised phoneme segmentation. arXiv preprint arXiv:2007.13465.
  36. Textless speech emotion conversion using decomposed and discrete representations. arXiv preprint arXiv:2111.07402.
  37. Taku Kudo. 2018. Subword regularization: Improving neural network translation models with multiple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 66–75, Melbourne, Australia. Association for Computational Linguistics.
  38. Brian Kulis and Michael I. Jordan. 2011. Revisiting k-means: New algorithms via bayesian nonparametrics. CoRR, abs/1111.0352.
  39. Melgan: Generative adversarial networks for conditional waveform synthesis.
  40. Generative spoken language modeling from raw audio. CoRR, abs/2102.01192.
  41. Unsupervised End-to-End Learning of Discrete Linguistic Units for Voice Conversion. In Proc. Interspeech 2019, pages 1108–1112.
  42. Efficient estimation of word representations in vector space.
  43. Subword language modeling with neural networks.
  44. Are word boundaries useful for unsupervised language learning?
  45. The zero resource speech benchmark 2021: Metrics and baselines for unsupervised spoken language modeling. CoRR, abs/2011.11588.
  46. Generative spoken dialogue language modeling. arXiv preprint arXiv:2203.16502.
  47. Are discrete units necessary for spoken language modeling? IEEE Journal of Selected Topics in Signal Processing, pages 1–9.
  48. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  49. Puyuan Peng and David Harwath. 2023. Word discovery in visually grounded, self-supervised speech models.
  50. A correspondence variational autoencoder for unsupervised acoustic word embeddings.
  51. Deep voice 3: 2000-speaker neural text-to-speech.
  52. The buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication, 45(1):89–95.
  53. Speech resynthesis from discrete disentangled self-supervised representations. CoRR, abs/2104.00355.
  54. Waveglow: A flow-based generative network for speech synthesis. CoRR, abs/1811.00002.
  55. Robust speech recognition via large-scale weak supervision. Technical report, Tech. Rep., OpenAI.
  56. Language models are unsupervised multitask learners.
  57. Pre-linguistic segmentation of speech into syllable-like units. Cognition, 171:130–150.
  58. Crowdmos: An approach for crowdsourcing mean opinion score studies. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2416–2419.
  59. Asr4real: An extended benchmark for speech models.
  60. Morgane Rivière and Emmanuel Dupoux. 2021. Towards unsupervised learning of speech features in the wild. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 156–163.
  61. Shane Settle and Karen Livescu. 2016. Discriminative acoustic word embeddings: Recurrent neural network-based approaches. CoRR, abs/1611.02550.
  62. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783.
  63. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions.
  64. A K-nearest neighbours approach to unsupervised spoken term discovery. In IEEE Spoken Language Technology SLT-2018, Proceedings of SLT 2018, Athènes, Greece.
  65. VQVAE Unsupervised Unit Discovery and Multi-Scale Code2Spec Inverter for Zerospeech Challenge 2019. In Proc. Interspeech 2019, pages 1118–1122.
  66. Wavenet: A generative model for raw audio. CoRR, abs/1609.03499.
  67. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748.
  68. Lisa Van Staden and Herman Kamper. 2020. A comparison of self-supervised speech representations as input features for unsupervised acoustic word embeddings.
  69. A review of vector quantization techniques. IEEE Potentials, 25(4):39–47.
  70. Attention is all you need. CoRR, abs/1706.03762.
  71. The zero resource speech challenge 2015: Proposed approaches and results. Procedia Computer Science, 81:67–72. SLTU-2016 5th Workshop on Spoken Language Technologies for Under-resourced languages 09-12 May 2016 Yogyakarta, Indonesia.
  72. Junjie Wu. 2012. The Uniform Effect of K-means Clustering, pages 17–35. Springer Berlin Heidelberg, Berlin, Heidelberg.
  73. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.
  74. SUPERB: Speech Processing Universal PERformance Benchmark. In Proc. Interspeech 2021, pages 1194–1198.
  75. Soundstream: An end-to-end neural audio codec. CoRR, abs/2107.03312.
  76. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.
  77. George Kingsley Zipf. 1949. Human behavior and the principle of least effort. Addison-Wesley Press.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Robin Algayres (14 papers)
  2. Yossi Adi (96 papers)
  3. Tu Anh Nguyen (12 papers)
  4. Jade Copet (26 papers)
  5. Gabriel Synnaeve (97 papers)
  6. Emmanuel Dupoux (81 papers)
  7. Benoit Sagot (9 papers)
Citations (11)