Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AudioPaLM: A Large Language Model That Can Speak and Listen (2306.12925v1)

Published 22 Jun 2023 in cs.CL, cs.AI, cs.SD, eess.AS, and stat.ML

Abstract: We introduce AudioPaLM, a LLM for speech understanding and generation. AudioPaLM fuses text-based and speech-based LLMs, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text LLMs such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only LLM improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio LLMs, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

An Overview of "AudioPaLM: A LLM That Can Speak and Listen"

The paper introduces AudioPaLM, a novel LLM fusing text-based LLMs with audio processing capabilities. Designed as a unified multimodal architecture, AudioPaLM integrates the strengths of the text-dominant PaLM-2 and the audio-capable AudioLM into a single model that can both speak and listen. This innovation extends the application of LLMs to areas requiring sophisticated speech understanding and generation, such as speech-to-text translation and speech recognition.

Model and Methodology

AudioPaLM stands out in its ability to jointly model speech and text using a shared vocabulary of discrete tokens, a novel feature surpassing traditional separate-domain models. The model accepts sequences of text and audio seamlessly, setting it apart from predecessors that used distinct token sets for audio and text. AudioPaLM harnesses the strengths of a transformer-based architecture, endowing the model with the ability to decode indefinitely interleaved speech and text tasks from a single setup. This design simplifies the training process across diverse tasks like Automatic Speech Recognition (ASR), Automatic Speech Translation (AST), and Speech-to-Speech Translation (S2ST), without requiring task-specific models.

Critically, the work leverages the text-only pretraining of LLMs by initializing AudioPaLM with weights borrowed from these models, allowing it to imbibe both linguistic knowledge and auditory features. This approach exploits the substantial linguistic and common-sense information present in text-based models through effective transfer learning, enhancing performance on audio tasks.

Results and Analysis

The experimental results convey a clear advantage of AudioPaLM over other models in various benchmarks, specifically within AST and S2ST domains. For instance, it demonstrated superior performance against models such as Whisper Large-v2 and mSLAM-CTC 2B. Distinctly, AudioPaLM achieved a BLEU score of 37.8 on the CoVoST2 AST task, surpassing the previous best of 30.7 (USM-M model). In S2ST, AudioPaLM produced a score of 32.5, significantly higher than the 25.6 achieved by Translatotron 2. Furthermore, its enhanced ability in zero-shot AST for numerous languages underlines its comprehensive capabilities beyond supervised datasets.

Linguistic and Audiovisual Synchronization

Exploring the potential of joint text and audio token vocabularies in LLMs is crucial to advancing models capable of diverse modalities encompassing natural language and beyond. AudioPaLM's novel approach of integrating separate linguistic and paralinguistic components without isolating the text domain from audio facets, enriches the potential applications of LLMs in multilingual settings.

Implications and Future Directions

This paper's proposals could induce significant shifts in multimodal LLMs, creating models adept across the language spectrum, given both recorded sounds or written text. The capacity of AudioPaLM to generalize zero-shot AST across languages underscores an influential stride towards widely accessible speech technologies. This makes it feasible to apply similar techniques across other multimodal data forms, such as vision, potentially broadening the horizon of multimodal understanding machines.

Future advancements may focus on refining the multimodal integration efficiency, particularly by enhancing the alignment between rich text pretraining and diverse audio data. As the paper provides a clear proof of concept, continued efforts might resolve remaining hurdles around audio tokenization and broad benchmark establishment for multimodal tasks, optimizing the automatic generation and translation processes in real-time applications.

In summary, AudioPaLM presents a robust framework indicating a significant shift towards more unified handling of speech and text within LLMs, opening pathways for further research and practical applications in artificial intelligence encompassing multilingual and intermodal tasks. The implications lie in improved human-computer interaction models, evolving the digital landscape into a more naturally engaging and fluid conversational space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  4. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.520.
  5. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  6. mslam: Massively multilingual joint pre-training for speech and text. arXiv preprint arXiv:2202.01374, 2022.
  7. Findings of the 2019 conference on machine translation (WMT19). In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 1–61. Association for Computational Linguistics, 2019. URL https://aclanthology.org/W19-5301.
  8. Findings of the 2020 conference on machine translation (WMT20). In Proceedings of the Fifth Conference on Machine Translation, pages 1–55. Association for Computational Linguistics, 2020. URL https://aclanthology.org/2020.wmt-1.1.
  9. Findings of the 2013 Workshop on Statistical Machine Translation. In Proceedings of the Eighth Workshop on Statistical Machine Translation, pages 1–44. Association for Computational Linguistics, 2013. URL https://aclanthology.org/W13-2201.
  10. Findings of the 2015 workshop on statistical machine translation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 1–46. Association for Computational Linguistics, 2015. URL https://aclanthology.org/W15-3001.
  11. Findings of the 2017 conference on machine translation (WMT17). In Proceedings of the Second Conference on Machine Translation, pages 169–214. Association for Computational Linguistics, 2017. URL https://aclanthology.org/W17-4717.
  12. Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, pages 272–303. Association for Computational Linguistics, 2018. URL https://aclanthology.org/W18-6401.
  13. AudioLM: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022.
  14. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
  15. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  16. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE J. Sel. Top. Signal Process., 16(6):1505–1518, 2022a. doi: 10.1109/JSTSP.2022.3188113. URL https://doi.org/10.1109/JSTSP.2022.3188113.
  17. PaLI: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022b.
  18. Uniter: Universal image-text representation learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX, pages 104–120. Springer, 2020.
  19. Maestro: Matched speech text representations through modality matching. arXiv preprint arXiv:2204.03409, 2022c.
  20. Self-supervised learning with random-projection quantizer for speech recognition. In International Conference on Machine Learning, pages 3915–3924. PMLR, 2022.
  21. PaLM: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  22. W2V-Bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In ASRU, 2021.
  23. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023.
  24. High fidelity neural audio compression. CoRR, abs/2210.13438, 2022. doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/arXiv.2210.13438.
  25. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  26. Singsong: Generating musical accompaniments from singing. CoRR, abs/2301.12662, 2023. doi: 10.48550/arXiv.2301.12662. URL https://doi.org/10.48550/arXiv.2301.12662.
  27. Violet: End-to-end video-language transformers with masked visual-token modeling. arXiv preprint arXiv:2111.12681, 2021.
  28. Low-resource speech recognition and keyword-spotting. In Speech and Computer: 19th International Conference, SPECOM 2017, Hatfield, UK, September 12-16, 2017, Proceedings 19, pages 3–19. Springer, 2017.
  29. Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems, 33:6616–6628, 2020.
  30. Textually pretrained speech language models. arXiv preprint arXiv:2305.13009, 2023.
  31. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  32. Leveraging weakly supervised data to improve end-to-end speech-to-text translation. In Proc. ICASSP, pages 7180–7184, 2019a.
  33. Direct speech-to-speech translation with a sequence-to-sequence model. In INTERSPEECH, 2019b.
  34. Png bert: Augmented bert on phonemes and graphemes for neural tts. Proc. Interspeech 2021, pages 151–155, 2021.
  35. Leveraging unsupervised and weakly-supervised data to improve direct speech-to-speech translation. arXiv preprint arXiv:2203.13339, 2022a.
  36. Translatotron 2: High-quality direct speech-to-speech translation with voice preservation. In International Conference on Machine Learning, pages 10120–10134. PMLR, 2022b.
  37. Cvss corpus and massively multilingual speech-to-speech translation. arXiv preprint arXiv:2201.03713, 2022c.
  38. Transformer-based direct speech-to-speech translation with transcoder. In Proc. IEEE SLT, pages 958–965, 2021.
  39. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
  40. Audiogen: Textually guided audio generation. CoRR, abs/2209.15352, 2022. doi: 10.48550/arXiv.2209.15352. URL https://doi.org/10.48550/arXiv.2209.15352.
  41. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018a.
  42. T. Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In E. Blanco and W. Lu, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018: System Demonstrations, Brussels, Belgium, October 31 - November 4, 2018, pages 66–71. Association for Computational Linguistics, 2018b. doi: 10.18653/v1/d18-2012. URL https://doi.org/10.18653/v1/d18-2012.
  43. On generative spoken language modeling from raw audio. Transactions of the Association for Computational Linguistics, 9:1336–1354, 2021.
  44. JANUS-III: Speech-to-speech translation in multiple languages. In ICASSP, 1997.
  45. Textless speech-to-speech translation on real data. arXiv preprint arXiv:2112.08352, 2021.
  46. Direct speech-to-speech translation with discrete units. In ACL, 2022.
  47. Direct simultaneous speech to speech translation. arXiv preprint arXiv:2110.08250, 2021.
  48. The ATR multilingual speech-to-speech translation system. IEEE Transactions on Audio, Speech, and Language Processing, 2006.
  49. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  50. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  51. M. Post. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, pages 186–191, Belgium, Brussels, Oct. 2018. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W18-6319.
  52. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.
  53. When and why are pre-trained word embeddings useful for neural machine translation? In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 529–535. Association for Computational Linguistics, 2018. URL https://aclanthology.org/N18-2084.
  54. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  55. Robust speech recognition via large-scale weak supervision. arXiv preprint arXiv:2212.04356, 2022.
  56. Scaling language models: Methods, analysis & insights from training gopher. CoRR, abs/2112.11446, 2021. URL https://arxiv.org/abs/2112.11446.
  57. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  58. Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In IEEE International Conference on Acoustics, Speech and Signal Processing (DNSMOS), 2021.
  59. Scaling up models and data with t5x and seqio, 2022.
  60. Improving neural machine translation models with monolingual data. In ACL, 2016.
  61. Learning audio-visual speech representation by masked multimodal cluster prediction, 2022.
  62. Speech-to-speech translation between untranscribed unknown languages. In Proc. IEEE ASRU, pages 593–600, 2019.
  63. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), pages 5998–6008, 2017.
  64. W. Wahlster. Verbmobil: Foundations of speech-to-speech translation. Springer, 2000.
  65. Covost 2: A massively multilingual speech-to-text translation corpus. CoRR, abs/2007.10310, 2020. URL https://arxiv.org/abs/2007.10310.
  66. Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation, 2021.
  67. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  68. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR, 2022.
  69. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022a.
  70. Joint pre-training with speech and bilingual text for direct speech to speech translation. arXiv:2210.17027, 2022b.
  71. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022a.
  72. Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022b. doi: 10.48550/arXiv.2206.10789.
  73. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
  74. SoundStream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  75. UWSpeech: Speech to speech translation for unwritten languages. In AAAI, 2021.
  76. Google USM: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023a.
  77. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. arXiv preprint arXiv:2303.03926, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (30)
  1. Paul K. Rubenstein (13 papers)
  2. Chulayuth Asawaroengchai (5 papers)
  3. Duc Dung Nguyen (8 papers)
  4. Ankur Bapna (53 papers)
  5. Zalán Borsos (18 papers)
  6. Félix de Chaumont Quitry (8 papers)
  7. Peter Chen (9 papers)
  8. Dalia El Badawy (5 papers)
  9. Wei Han (202 papers)
  10. Eugene Kharitonov (25 papers)
  11. Hannah Muckenhirn (4 papers)
  12. Dirk Padfield (7 papers)
  13. James Qin (20 papers)
  14. Danny Rozenberg (1 paper)
  15. Tara Sainath (19 papers)
  16. Johan Schalkwyk (7 papers)
  17. Matt Sharifi (9 papers)
  18. Michelle Tadmor Ramanovich (7 papers)
  19. Marco Tagliasacchi (37 papers)
  20. Alexandru Tudor (1 paper)
Citations (214)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com