VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction (2501.01957v1)

Published 3 Jan 2025 in cs.CV, cs.SD, and eess.AS

Abstract: Recent Multimodal LLMs (MLLMs) have typically focused on integrating visual and textual modalities, with less emphasis placed on the role of speech in enhancing interaction. However, speech plays a crucial role in multimodal dialogue systems, and implementing high-performance in both vision and speech tasks remains a significant challenge due to the fundamental modality differences. In this paper, we propose a carefully designed multi-stage training methodology that progressively trains LLM to understand both visual and speech information, ultimately enabling fluent vision and speech interaction. Our approach not only preserves strong vision-language capacity, but also enables efficient speech-to-speech dialogue capabilities without separate ASR and TTS modules, significantly accelerating multimodal end-to-end response speed. By comparing our method against state-of-the-art counterparts across benchmarks for image, video, and speech tasks, we demonstrate that our model is equipped with both strong visual and speech capabilities, making near real-time vision and speech interaction.

PDF Abstract

VITA-1.5: Advancements in Multimodal LLMs with Vision and Speech Integration

This paper presents "VITA-1.5," a novel Multimodal LLM (MLLM) designed to integrate visual and speech modalities for seamless real-time interactions. This research addresses the significant challenge of effectively combining visual and speech information to enhance multimodal dialogue systems, which have traditionally focused more heavily on visual-textual modalities. The researchers propose a multi-stage training methodology that allows the model to manage and optimize the distinct features of vision and speech data while maintaining efficient processing capabilities.

Key Contributions

Three-stage Training Methodology:
- Stage 1 (Vision-Language Training): This stage focuses on integrating visual data into the LLM, employing a strategy that emphasizes vision alignment, vision understanding, and vision-specific fine-tuning (SFT).
- Stage 2 (Audio Input Tuning): Here, the model undergoes audio alignment to bridge the gap between speech and language. A subset of vision-language adaptations is utilized to help the model understand and respond to audio input.
- Stage 3 (Audio Output Tuning): This final stage introduces speech output capabilities, removing the need for external text-to-speech (TTS) modules and enhancing the user experience through end-to-end speech generation.
Model Architecture: VITA-1.5 incorporates a sophisticated architecture that includes vision and audio encoders, adapters, and a non-autoregressive speech decoder. Through the use of advanced components like InternViT and TiCodec, the model is structured to manage multimodal inputs and outputs efficiently.
Data Utilization: The model utilizes a comprehensive dataset covering a variety of modalities, including images, videos, speech transcription pairs, and text-speech pairs, sourced from diverse benchmarks to inform its training strategy.

Evaluation and Results

VITA-1.5 undergoes extensive evaluation across a broad array of benchmarks:

Vision-Language Capabilities: The model demonstrates performance on par with leading open-source MLLMs and even outperforms some proprietary models in terms of image understanding and reasoning tasks. The architecture effectively retains its vision-language strengths post-training in both speech tuning stages.
Video Understanding: Results indicate comparable performance to open-source models, though there exists potential for further development relative to top proprietary models.
Speech Recognition (ASR): VITA-1.5 shows strong results in both Mandarin and English ASR tasks, outperforming specialized speech models and thereby confirming the model's robust multimodal integration.

Implications and Future Directions

The research outlined in this paper has both practical and theoretical implications. Practically, VITA-1.5 provides a substantial step forward in creating more capable and efficient interactive multimodal dialogue systems, eliminating the need for separate ASR and TTS components, thus reducing system latency.

Theoretically, the novel training methodology demonstrates a viable framework for future endeavors in multimodal integration. Future developments in AI can build on this approach to further enhance the harmony between disparate modalities, potentially leading to even greater advancements in real-time human-computer interaction systems.

Overall, VITA-1.5 represents a significant contribution to the field of multimodal LLMs, providing a flexible and efficient architecture and training strategy that balances and optimizes both vision-language and speech interaction without compromising on performance in any individual domain.

PDF Markdown Bookmark Chat (Pro)

Authors (15)

Chaoyou Fu (46 papers)
Haojia Lin (7 papers)
Xiong Wang (52 papers)
Yi-Fan Zhang (32 papers)
Yunhang Shen (54 papers)
Xiaoyu Liu (138 papers)
Yangze Li (11 papers)
Zuwei Long (5 papers)
Heting Gao (13 papers)
Ke Li (722 papers)
Xiawu Zheng (63 papers)
Rongrong Ji (315 papers)
Xing Sun (93 papers)
Caifeng Shan (27 papers)
Ran He (172 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1876121786422890618

https://twitter.com/rohanpaul_ai/status/1878929745003548878

https://twitter.com/javaeeeee1/status/1876230205783802331

https://twitter.com/kimjjgeek/status/1876769958681612355