AudioPaLM: Unified Text & Speech Model
- AudioPaLM is a unified multimodal model that integrates text and speech capabilities through a shared decoder-only Transformer architecture.
- It employs a hierarchical audio tokenization pipeline that combines SentencePiece text tokens with discrete audio tokens for efficient speech recognition, translation, and generation.
- The model preserves paralinguistic features like speaker identity and prosody, enabling high-fidelity voice transfer and competitive zero-shot cross-lingual performance.
AudioPaLM is a unified LLM integrating text-based and speech-based capabilities to perform a wide range of speech understanding and generation tasks, including automatic speech recognition (ASR), automatic speech-to-text translation (AST), text-to-speech (TTS), and direct speech-to-speech translation (S2ST). It fuses a pretrained text-only LLM, PaLM-2, with the discrete audio tokenization pipeline of AudioLM, resulting in a decoder-only Transformer that can ingest and output both text and audio tokens within a single model architecture. AudioPaLM preserves and transfers paralinguistic information such as speaker identity and prosody, enabling high-fidelity cross-lingual voice transfer and state-of-the-art performance on multilingual benchmarks (Rubenstein et al., 2023).
1. Model Architecture
AudioPaLM leverages a decoder-only Transformer structure, with the architecture nearly identical to PaLM-2 except for an expanded input/output vocabulary. The unified token vocabulary comprises:
- SentencePiece text tokens as in PaLM-2
- Two types of audio tokens produced by AudioLM’s pipeline (semantic and acoustic/SoundStream tokens)
The token embedding matrix is extended from the original text-only (where is the size of the text vocabulary, is the model dimension) to a joint embedding , stacking a randomly initialized below , with the number of audio tokens. The final linear softmax layer shares parameters with the input embedding, .
Audio tokenization is hierarchical. Raw waveforms are first transformed to self-supervised embeddings (e.g., w2v-BERT or USM variants) at 25 Hz. K-means quantization produces semantic tokens (vocab size 1024), which condition an autoregressive model to predict SoundStream tokens for coarse and fine waveform reconstruction. The model only generates semantic tokens, with subsequent waveform synthesis handled by a separately trained AudioLM or non-autoregressive SoundStorm decoder.
Within the Transformer, token modality is not explicitly distinguished; audio and text tokens are fully integrated by the shared attention and feedforward sublayers, supporting seamless multimodal interactions (Rubenstein et al., 2023).
2. Training Regimen and Task Mixtures
AudioPaLM is initialized from a pretrained PaLM-2 checkpoint (8B parameters, trained on text tokens; cross-entropy objective).
Subsequent finetuning adapts the model to speech and translation tasks using a unified cross-entropy objective. The token sequence to be predicted is masked according to the specific task setup, with no need for separate CTC or alignment losses. Optimizer settings include Adafactor, learning rate , dropout 0.1, and mixed precision. Two principal finetuning mixtures are described:
- AST/T2T mixture: ASR (VoxPopuli-ASR, CommonVoice, YouTube-ASR), AST (CoVoST2, Conversational EsEn), MT (WMT/TED), and composite ASR-then-AST tasks.
- S2ST mixture: Includes all AST/T2T tasks plus TTS (CVSS, VoxPopuli), S2ST (VoxPopuli S2ST, CVSS, WMT/TED synthetic), and chained pipelines (ASR→AST→S2ST).
Tasks are signaled by prefix text tokens, e.g., “[S2ST English French]...audio tokens...”.
3. Preservation and Transfer of Paralinguistic Information
AudioPaLM inherits AudioLM’s mechanism for encoding and transferring paralinguistic features. Speaker identity, intonation, and acoustic conditions are preserved implicitly via the discrete audio token representation. During S2ST, generated semantic tokens are passed to a conditioned decoder which attends to a three-second segment of source speaker SoundStream codes, thus transferring speaker characteristics (timbre, prosody) to the translated output.
No explicit speaker-embedding vectors are learned; paralinguistics are carried by the sequence of discrete audio tokens. Voice transfer functionality is achieved by prompting the model with a short sample of the source speaker’s audio tokens, facilitating accurate cross-lingual and cross-speaker speech generation (Rubenstein et al., 2023).
4. Evaluation Metrics and Benchmark Performance
Evaluation tasks span ASR (VoxPopuli-ASR, CoVoST2 [CER]), AST (CoVoST2, FLEURS zero-shot), and S2ST (CVSS T, via ASR-BLEU). Key metrics:
- ASR: Word Error Rate (WER), Character Error Rate (CER)
- AST/S2ST: BLEU, ASR-BLEU
- Voice Quality: DNSMOS for audio quality, cosine similarity and subjective (SMOS) for voice similarity and acoustic consistency
A summary of key results for 8B AudioPaLM variants:
| Model | CoVoST2 AST BLEU | CVSS S2ST ASR-BLEU | VoxPopuli ASR WER |
|---|---|---|---|
| Whisper Large-v2 (1.5B) | 29.1 | – | 13.6 |
| mSLAM-CTC (2B) | 25.2 | – | 9.1 |
| USM-M (2B) | 30.7 | – | – |
| Translatotron 2 + TTS aug | – | 25.6 | – |
| AudioPaLM 8B AST | 35.4 | – | 11.1 |
| AudioPaLM 8B S2ST | 36.2 | 32.5 | 16.0 |
| AudioPaLM-2 8B AST | 37.8 | – | 9.8 |
On FLEURS zero-shot AST, AudioPaLM-2 8B AST achieved BLEU=28.6 on AST-observed languages (29), and BLEU=20.7 on ASR-only observed languages (26), outperforming Whisper Large-v2 trained with far more AST/ASR data. In CVSS-T S2ST, AudioPaLM matches or exceeds objective and subjective metrics for both audio quality and voice similarity, with acoustic consistency (cosine) of 0.81 versus 0.54 for ground-truth (Rubenstein et al., 2023).
5. Case Studies: Zero-Shot Transfer and Cross-Speaker Translation
AudioPaLM demonstrates robust zero-shot AST into languages with no parallel AST training data, yielding BLEU>15 on languages such as Maltese, Occitan, and Pashto. The model also retains high-fidelity voice transfer, preserving voice characteristics across language boundaries, closely matching ground-truth TTS in subjective ratings.
6. Limitations and Future Directions
AudioPaLM’s performance depends critically on the quality of the audio tokenizer. Ablation studies show substantial gains when advancing from w2v-BERT to USM-v2 tokenizers. The finetuning procedure necessitates updating the entire model; attempting to freeze PaLM-2 (as in adapter-based multimodal approaches) disrupts integration of audio tokens, increasing risk of catastrophic forgetting. Model behavior reflects training data biases, with underperformance possible for underrepresented accents, languages, or noisy conditions.
Future work includes systematic audio token studies (redundancy, fidelity, rate), expanding generative evaluation beyond ASR/AST (e.g., open-domain spoken dialog), improving fairness and robustness through balanced corpora or debiasing, and potential integration with vision and other modalities to build fully multimodal "listen, see, and speak" agents (Rubenstein et al., 2023).
7. Significance and Implications
AudioPaLM evidences that large, unified, decoder-only models initialized from extensive text pretraining can, via introduction of discrete audio tokenization and finetuning on mixed speech-text tasks, achieve state-of-the-art or competitive performance on multilingual speech recognition, translation, and voice conversion. This architecture offers a scalable paradigm, capable of zero-shot transfer across languages and speakers, and provides a modular foundation for further multimodal extensions.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free