AudioPaLM: A Large Language Model That Can Speak and Listen (2306.12925v1)

Published 22 Jun 2023 in cs.CL, cs.AI, cs.SD, eess.AS, and stat.ML

Abstract: We introduce AudioPaLM, a LLM for speech understanding and generation. AudioPaLM fuses text-based and speech-based LLMs, PaLM-2 [Anil et al., 2023] and AudioLM [Borsos et al., 2022], into a unified multimodal architecture that can process and generate text and speech with applications including speech recognition and speech-to-speech translation. AudioPaLM inherits the capability to preserve paralinguistic information such as speaker identity and intonation from AudioLM and the linguistic knowledge present only in text LLMs such as PaLM-2. We demonstrate that initializing AudioPaLM with the weights of a text-only LLM improves speech processing, successfully leveraging the larger quantity of text training data used in pretraining to assist with the speech tasks. The resulting model significantly outperforms existing systems for speech translation tasks and has the ability to perform zero-shot speech-to-text translation for many languages for which input/target language combinations were not seen in training. AudioPaLM also demonstrates features of audio LLMs, such as transferring a voice across languages based on a short spoken prompt. We release examples of our method at https://google-research.github.io/seanet/audiopalm/examples

PDF HTML Abstract

An Overview of "AudioPaLM: A LLM That Can Speak and Listen"

The paper introduces AudioPaLM, a novel LLM fusing text-based LLMs with audio processing capabilities. Designed as a unified multimodal architecture, AudioPaLM integrates the strengths of the text-dominant PaLM-2 and the audio-capable AudioLM into a single model that can both speak and listen. This innovation extends the application of LLMs to areas requiring sophisticated speech understanding and generation, such as speech-to-text translation and speech recognition.

Model and Methodology

AudioPaLM stands out in its ability to jointly model speech and text using a shared vocabulary of discrete tokens, a novel feature surpassing traditional separate-domain models. The model accepts sequences of text and audio seamlessly, setting it apart from predecessors that used distinct token sets for audio and text. AudioPaLM harnesses the strengths of a transformer-based architecture, endowing the model with the ability to decode indefinitely interleaved speech and text tasks from a single setup. This design simplifies the training process across diverse tasks like Automatic Speech Recognition (ASR), Automatic Speech Translation (AST), and Speech-to-Speech Translation (S2ST), without requiring task-specific models.

Critically, the work leverages the text-only pretraining of LLMs by initializing AudioPaLM with weights borrowed from these models, allowing it to imbibe both linguistic knowledge and auditory features. This approach exploits the substantial linguistic and common-sense information present in text-based models through effective transfer learning, enhancing performance on audio tasks.

Results and Analysis

The experimental results convey a clear advantage of AudioPaLM over other models in various benchmarks, specifically within AST and S2ST domains. For instance, it demonstrated superior performance against models such as Whisper Large-v2 and mSLAM-CTC 2B. Distinctly, AudioPaLM achieved a BLEU score of 37.8 on the CoVoST2 AST task, surpassing the previous best of 30.7 (USM-M model). In S2ST, AudioPaLM produced a score of 32.5, significantly higher than the 25.6 achieved by Translatotron 2. Furthermore, its enhanced ability in zero-shot AST for numerous languages underlines its comprehensive capabilities beyond supervised datasets.

Linguistic and Audiovisual Synchronization

Exploring the potential of joint text and audio token vocabularies in LLMs is crucial to advancing models capable of diverse modalities encompassing natural language and beyond. AudioPaLM's novel approach of integrating separate linguistic and paralinguistic components without isolating the text domain from audio facets, enriches the potential applications of LLMs in multilingual settings.

Implications and Future Directions

This paper's proposals could induce significant shifts in multimodal LLMs, creating models adept across the language spectrum, given both recorded sounds or written text. The capacity of AudioPaLM to generalize zero-shot AST across languages underscores an influential stride towards widely accessible speech technologies. This makes it feasible to apply similar techniques across other multimodal data forms, such as vision, potentially broadening the horizon of multimodal understanding machines.

Future advancements may focus on refining the multimodal integration efficiency, particularly by enhancing the alignment between rich text pretraining and diverse audio data. As the paper provides a clear proof of concept, continued efforts might resolve remaining hurdles around audio tokenization and broad benchmark establishment for multimodal tasks, optimizing the automatic generation and translation processes in real-time applications.

In summary, AudioPaLM presents a robust framework indicating a significant shift towards more unified handling of speech and text within LLMs, opening pathways for further research and practical applications in artificial intelligence encompassing multilingual and intermodal tasks. The implications lie in improved human-computer interaction models, evolving the digital landscape into a more naturally engaging and fluid conversational space.