- The paper introduces Voila, a groundbreaking model that integrates voice and language for real-time, full-duplex, and emotionally rich interactions.
- It leverages a hierarchical multi-scale Transformer with a novel voice tokenizer and structured interleaved alignment to enhance audio-text synchronization.
- The model supports diverse tasks like ASR, TTS, and speech translation, offering customizable voices and outperforming benchmarks against existing systems.
The paper introduces Voila (V), a family of large voice-language foundation models designed to enable real-time, autonomous, and emotionally expressive voice interactions. The authors aim to move beyond traditional pipeline-based voice AI systems (like Siri, Alexa, Google Assistant), which suffer from high latency, loss of vocal nuances due to text-as-an-intermediate step, and reactive, turn-based interactions. The vision is to create an autonomous AI that continuously listens, reasons, and responds proactively in a full-duplex manner, akin to natural human conversation.
The Voila family includes Voila-e2e for low-latency, end-to-end voice conversation and Voila-autonomous, which extends this to full-duplex, simultaneous interaction. Both models are built on a hierarchical multi-scale Transformer architecture that integrates a LLM backbone for semantic reasoning and an audio Transformer for acoustic modeling.
Key technical contributions and features of Voila include:
- Effective Integration of Voice and Language Capabilities: The model combines the strengths of pre-trained LLMs with learned voice modeling. Users can provide text instructions to define a speaker's persona and steer voice responses, leveraging the LLM's knowledge and reasoning.
- Voice Tokenizer: A neural audio codec based on residual vector quantization (RVQ), similar to SpeechTokenizer. It distills semantic information into the first RVQ layer and acoustic information into subsequent layers, converting continuous audio into discrete tokens suitable for Transformer models.
- Text and Audio Alignment: The model integrates discrete audio tokens into the LLM's vocabulary. Training is conducted on multiple tasks (ASR, TTS, instruction following) using a chat-style format and next-token prediction. A key innovation is the structured interleaved alignment of text and audio tokens, where each semantic unit of text is explicitly paired with its corresponding audio tokens. This is in contrast to prior methods with looser coupling and aims to improve fine-grained synchronization and expressiveness. For Voila-autonomous, a two-stream input processes user and system audio simultaneously, fusing their embeddings for full-duplex operation.
- Voice Customization: Voila allows easy customization of voices using a learnable special token conditioned by a voice embedding extracted from a reference audio clip (as short as a few seconds) using Wespeaker. This enables users to plug in new voices on the fly. The authors leverage this capability to offer over one million pre-built voices and efficient creation of new ones.
- Unified Model for Audio Tasks: Voila supports multiple tasks like ASR and TTS within the same model architecture and can be easily adapted for tasks like speech translation through fine-tuning. It is trained on large multilingual data and supports six languages: English, Chinese, French, German, Japanese, and Korean.
To evaluate these models, the authors introduce the Voila Benchmark, an audio-language evaluation suite created by converting samples from five standard LLM evaluation datasets (MMLU, MATH, HumanEval, NQ-Open, GSM8K) into speech using off-the-shelf TTS systems. This benchmark assesses the model's ability to handle reasoning and knowledge-based questions from audio inputs.
Experimental results show that Voila significantly outperforms prior open-source audio-LLMs like SpeechGPT and Moshi on the Voila Benchmark, achieving a score of 30.56 compared to 13.29 and 11.45, respectively. The improvements are particularly notable in math and code domains, indicating effective integration of the LLM's reasoning capabilities. Voila also demonstrates competitive performance on standard ASR (LibriSpeech test-clean WER 4.8% without Librispeech training data, 2.7% with) and TTS (LibriSpeech test-clean WER 3.2% without Librispeech training data, 2.8% with) tasks compared to state-of-the-art and baseline models.
The authors release the models (Voila-base, Voila-chat, Voila-autonomous-preview, Voila-Tokenizer, Voila Voice Library) and code openly, including the Voila Benchmark, to foster further research and development towards autonomous voice AI agents.