Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant (2410.15316v1)

Published 20 Oct 2024 in cs.CL, cs.SD, and eess.AS

Abstract: LLMs have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech LLMs and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-LLMs.

Authors (3)

Alan Dao (7 papers)
Dinh Bach Vu (5 papers)
Huy Hoang Ha (3 papers)

Citations (1)

View on Semantic Scholar

Summary

Overview of "Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant"

The research paper titled "Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant" introduces a novel approach to integrating speech and text modalities within a unified framework. The authors propose Ichigo, a mixed-modal model that utilizes an innovative tokenized early-fusion strategy, allowing seamless processing of interleaved speech and text sequences.

Model Architecture and Methodology

Ichigo employs a uniform transformer-based architecture, where both speech and text are represented as discrete tokens. This approach eliminates the need for separate encoders or adapters for each modality, facilitating cross-modal reasoning and generation. Speech inputs are quantized using WhisperVQ, transforming them into discrete tokens similar to text, and integrated into a shared representational space. The architecture expands pre-existing LLMs by incorporating new modality-specific tokens, ensuring that both speech and text can be processed autoregressively.

The model leverages a comprehensive training process, involving pre-training on multilingual ASR datasets followed by fine-tuning on a specially curated instruction dataset. This methodology emphasizes multimodal representation learning without substantial degradation of original language capabilities, achieving significant improvements over traditional cascaded systems.

Evaluation and Results

Ichigo was evaluated against existing speech LLMs using the AudioBench framework. It outperformed other open-source models in speech question-answering benchmarks, achieving scores of 67.8 and 67.2 on OpenHermes-Audio and ALPACA-Audio respectively. The model's latency to first token generation is notably efficient at 111 ms, outperforming comparable systems, evidencing its potential for real-time applications.

Implications and Future Directions

This research offers substantial implications for the field of multimodal AI. Ichigo's architecture sets a benchmark for efficiently integrating and processing speech and text in a unified model, with lower latency and enhanced performance over existing solutions. The work provides a viable framework for smaller research teams to develop competitive open-source speech-LLMs without the need for extensive resources.

Future developments may focus on overcoming the current limitations, such as improving emotional comprehension and extending context length for handling richer audio data and multi-turn dialogues. Additionally, further exploration into training stability with acoustic tokens may unlock even more advanced multimodal capabilities.

Conclusion

Overall, Ichigo represents a significant advancement in the seamless integration of speech and text processing within LLMs. Its innovative tokenized early-fusion approach not only enhances performance but also reduces complexity, paving the way for more accessible breakthroughs in the field of AI voice assistants. The paper contributes a robust framework that may inspire further innovations and applications in multimodal AI systems.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/pokachi2023/status/1851449034298577252

https://twitter.com/arXivGPT/status/1849193518801256730

https://twitter.com/except_raised/status/1922652092679823514

Reddit

[R] Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant (14 points, 0 comments)