Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM (2305.15255v4)

Published 24 May 2023 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present Spectron, a novel approach to adapting pre-trained LLMs to perform spoken question answering (QA) and speech continuation. By endowing the LLM with a pre-trained speech encoder, our model becomes able to take speech inputs and generate speech outputs. The entire system is trained end-to-end and operates directly on spectrograms, simplifying our architecture. Key to our approach is a training objective that jointly supervises speech recognition, text continuation, and speech synthesis using only paired speech-text pairs, enabling a `cross-modal' chain-of-thought within a single decoding pass. Our method surpasses existing spoken LLMs in speaker preservation and semantic coherence. Furthermore, the proposed model improves upon direct initialization in retaining the knowledge of the original LLM as demonstrated through spoken QA datasets. We release our audio samples (https://michelleramanovich.github.io/spectron/spectron) and spoken QA dataset (https://github.com/google-research-datasets/LLAMA1-Test-Set).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. MusicLM: Generating music from text. CoRR, abs/2301.11325, 2023.
  2. Speecht5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205, 2021.
  3. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1533–1544, 2013.
  4. AudioLM: a language modeling approach to audio generation. CoRR, abs/2209.03143, 2022.
  5. Language models are few-shot learners. In NeurIPS, 2020.
  6. PaLM: Scaling language modeling with pathways. CoRR, abs/2204.02311, 2022.
  7. w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In ASRU, pp.  244–250. IEEE, 2021.
  8. The spotify podcast dataset. arXiv preprint arXiv:2004.04270, 2020.
  9. High fidelity neural audio compression. CoRR, abs/2210.13438, 2022.
  10. Improving hybrid CTC/attention end-to-end speech recognition with pretrained acoustic and language models. In ASRU, pp.  76–82. IEEE, 2021.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pp.  4171–4186. Association for Computational Linguistics, 2019.
  12. Jukebox: A generative model for music. CoRR, abs/2005.00341, 2020.
  13. Prompting large language models with speech recognition abilities. arXiv preprint arXiv:2307.11795, 2023.
  14. The people’s speech: A large-scale diverse english speech recognition dataset for commercial usage. arXiv preprint arXiv:2111.09344, 2021.
  15. Listen, think, and understand. arXiv preprint arXiv:2305.10790, 2023.
  16. Google. PaLM 2 technical report, 2023. https://ai.google/static/documents/palm2techreport.pdf.
  17. How may I help you? Speech Commun., 23(1-2):113–127, 1997.
  18. Conformer: Convolution-augmented transformer for speech recognition. In INTERSPEECH, pp.  5036–5040. ISCA, 2020.
  19. Textually pretrained speech language models. CoRR, abs/2305.13009, 2023.
  20. General-purpose, long-context autoregressive modeling with perceiver AR. In ICML, volume 162 of Proceedings of Machine Learning Research, pp.  8535–8558. PMLR, 2022.
  21. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE ACM Trans. Audio Speech Lang. Process., 29:3451–3460, 2021.
  22. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  23. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995, 2023.
  24. Spoken dialogue systems. Synthesis Lectures on Human Language Technologies, 2(1):1–151, 2009.
  25. TPU v4: An optically reconfigurable supercomputer for machine learning with hardware support for embeddings. arXiv preprint arXiv:2304.01433, 2023.
  26. Libri-Light: A benchmark for ASR with limited or no supervision. In ICASSP, pp.  7669–7673. IEEE, 2020.
  27. Text-free prosody-aware generative spoken language modeling. In ACL (1), pp.  8666–8681. Association for Computational Linguistics, 2022.
  28. Adam: A method for stochastic optimization. arXiv:1412.6980, 2014.
  29. WaveFit: an iterative and non-autoregressive neural vocoder based on fixed-point iteration. In SLT, pp.  884–891. IEEE, 2022.
  30. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Proc. NeurIPS, 33:17022–17033, 2020.
  31. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021. doi: 10.1162/tacl_a_00430. URL https://aclanthology.org/2021.tacl-1.79.
  32. Music understanding llama: Advancing text-to-music generation with question answering and captioning. arXiv preprint arXiv:2308.11276, 2023a.
  33. Wavjourney: Compositional audio creation with large language models. arXiv preprint arXiv:2307.14335, 2023b.
  34. Residual adapters for few-shot text-to-speech speaker adaptation. arXiv preprint arXiv:2210.15868, 2022.
  35. Are discrete units necessary for spoken language modeling? IEEE J. Sel. Top. Signal Process., 16(6):1415–1423, 2022.
  36. Show your work: Scratchpads for intermediate computation with language models. CoRR, abs/2112.00114, 2021.
  37. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pp.  5206–5210. IEEE, 2015.
  38. SpecAugment: A simple data augmentation method for automatic speech recognition. In INTERSPEECH, 2019.
  39. Improved noisy student training for automatic speech recognition. In INTERSPEECH, pp.  2817–2821. ISCA, 2020.
  40. Spoken language intelligence of large language models for language learning. arXiv preprint arXiv:2308.14536, 2023.
  41. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  42. Almost unsupervised text to speech and automatic speech recognition. In International conference on machine learning, pp. 5410–5419. PMLR, 2019.
  43. Towards unsupervised learning of speech features in the wild. In 2021 IEEE Spoken Language Technology Workshop (SLT), pp. 156–163. IEEE, 2021.
  44. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100, 2022.
  45. Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In ICASSP, pp.  4779–4783. IEEE, 2018.
  46. Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301, 2020.
  47. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  48. Attention is all you need. In NIPS, pp.  5998–6008, 2017.
  49. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In ACL 2021-59th Annual Meeting of the Association for Computational Linguistics, 2021.
  50. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022.
  51. Transformers: State-of-the-art natural language processing. In EMNLP (Demos), pp.  38–45. Association for Computational Linguistics, 2020.
  52. Self-training with noisy student improves ImageNet classification. In CVPR, pp.  10684–10695. Computer Vision Foundation / IEEE, 2020.
  53. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022.
  54. GLM-130B: an open bilingual pre-trained model. 2023.
  55. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023a.
  56. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068, 2022a.
  57. Google USM: Scaling automatic speech recognition beyond 100 languages. CoRR, abs/2303.01037, 2023b.
  58. Speechlm: Enhanced speech pre-training with unpaired textual data. arXiv preprint arXiv:2209.15329, 2022b.
  59. Librisqa: Pioneering free-form and open-ended spoken question answering with a novel dataset and framework. arXiv preprint arXiv:2308.10390, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Eliya Nachmani (38 papers)
  2. Alon Levkovitch (4 papers)
  3. Roy Hirsch (7 papers)
  4. Julian Salazar (17 papers)
  5. Chulayuth Asawaroengchai (5 papers)
  6. Soroosh Mariooryad (11 papers)
  7. Ehud Rivlin (29 papers)
  8. RJ Skerry-Ryan (21 papers)
  9. Michelle Tadmor Ramanovich (7 papers)
Citations (19)
Github Logo Streamline Icon: https://streamlinehq.com