VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient Large Speech-Language Model (2505.03739v1)

Published 6 May 2025 in cs.CL and cs.AI

Abstract: With the growing requirement for natural human-computer interaction, speech-based systems receive increasing attention as speech is one of the most common forms of daily communication. However, the existing speech models still experience high latency when generating the first audio token during streaming, which poses a significant bottleneck for deployment. To address this issue, we propose VITA-Audio, an end-to-end large speech model with fast audio-text token generation. Specifically, we introduce a lightweight Multiple Cross-modal Token Prediction (MCTP) module that efficiently generates multiple audio tokens within a single model forward pass, which not only accelerates the inference but also significantly reduces the latency for generating the first audio in streaming scenarios. In addition, a four-stage progressive training strategy is explored to achieve model acceleration with minimal loss of speech quality. To our knowledge, VITA-Audio is the first multi-modal LLM capable of generating audio output during the first forward pass, enabling real-time conversational capabilities with minimal latency. VITA-Audio is fully reproducible and is trained on open-source data only. Experimental results demonstrate that our model achieves an inference speedup of 3~5x at the 7B parameter scale, but also significantly outperforms open-source models of similar model size on multiple benchmarks for automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA) tasks.

Summary

Analyzing VITA-Audio: An End-to-End Speech Model for Real-Time Human-Computer Interaction

The paper under review introduces VITA-Audio, a multi-modal LLM tailored for effective audio output and interaction in real time. Built to address performance bottlenecks in speech-based systems, VITA-Audio incorporates a foundational innovation: the Multiple Cross-modal Token Prediction (MCTP) module, enabling the rapid generation of audio tokens. This system offers notable implications for real-time applications while maintaining high fidelity in audio-text interactions.

Key Features of VITA-Audio

VITA-Audio succeeds in optimizing latency commonly found in existing speech models. The proposed MCTP module efficiently predicts multiple audio tokens within a single forward pass, transitioning from the generation of the first text token to immediate production of the first audio chunk. This is supported by a four-stage progressive training strategy that couples speed with quality.

First Forward-Pass Audio Generation: This capability positions VITA-Audio as the first multi-modal LLM proficient in rendering audio instantly, crucial for real-time speech applications. The MCTP's architectural simplicity allows for decoding audio tokens directly from LLM hidden states, surpassing the expectation of complex semantic modeling often requisite in conventional systems.
Accelerated Inference and Reduced Latency: VITA-Audio demonstrates a substantial inference acceleration—reportedly achieving a speed increase of up to 5 times over models of similar scale. Performance assessments reveal significant improvements when measured against benchmarks in tasks such as automatic speech recognition (ASR), text-to-speech (TTS), and spoken question answering (SQA).
Robust Evaluation Across Benchmarks: The comprehensive evaluations of VITA-Audio exhibit superior proficiency in ASR, TTS, and SQA tasks over open-source counterparts, especially evident when pitted against proprietary systems with a similar parameter scale. The robustness of the model is further reinforced by its training on entirely open-source datasets, aiming to democratize advanced AI speech modeling capabilities.

Methodological Approach

The development of VITA-Audio incorporates an end-to-end pipeline integrating audio encoders and decoders with the LLM core. The approach avoids complexity by leveraging empirical findings that align text and audio modalities monotonically, allowing the MCTP modules to leverage this structured alignment. This design enables VITA-Audio to achieve readiness for real-time, streaming audio generation without accruing initial pass delays commonly seen in traditional systems.

Following a structured four-stage training paradigm ensures optimization at each level. The stages include audio-text alignment, progressive training through multiple MCTP modules, and supervised fine-tuning, all culminating in a nuanced model capable of maintaining accuracy across diverse tasks.

Implications and Future Prospects

The introduction of the MCTP module positions VITA-Audio as a significant development within the field of real-time speech interaction models. Its demonstrated reduction in latency can redefine how LLMs are deployed in speech-centric applications, enabling more seamless interactions in domains such as customer service, assistive technology, and virtual assistant applications.

Future research avenues may explore the scalability of VITA-Audio’s approach, investigating whether similar architectural frameworks can be adapted for other multi-modal interactions beyond speech. Consideration could be given to its integration with more extensive datasets or expanding its multi-lingual capabilities, further broadening the scope of its application.

In summation, VITA-Audio presents a notable advancement in speech processing for AI applications. Its ability to render both quality and speed in real-time interactions is a testament to the careful design choices and modular innovations employed by the researchers. As speech systems continue to evolve, models like VITA-Audio may set the standard for future developments in interactive AI systems.

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Find Related Papers