Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant (2410.15316v1)

Published 20 Oct 2024 in cs.CL, cs.SD, and eess.AS

Abstract: LLMs have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech LLMs and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Alan Dao (7 papers)
  2. Dinh Bach Vu (5 papers)
  3. Huy Hoang Ha (3 papers)
Citations (1)

Summary

Overview of "Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant"

The research paper titled "Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant" introduces a novel approach to integrating speech and text modalities within a unified framework. The authors propose Ichigo, a mixed-modal model that utilizes an innovative tokenized early-fusion strategy, allowing seamless processing of interleaved speech and text sequences.

Model Architecture and Methodology

Ichigo employs a uniform transformer-based architecture, where both speech and text are represented as discrete tokens. This approach eliminates the need for separate encoders or adapters for each modality, facilitating cross-modal reasoning and generation. Speech inputs are quantized using WhisperVQ, transforming them into discrete tokens similar to text, and integrated into a shared representational space. The architecture expands pre-existing LLMs by incorporating new modality-specific tokens, ensuring that both speech and text can be processed autoregressively.

The model leverages a comprehensive training process, involving pre-training on multilingual ASR datasets followed by fine-tuning on a specially curated instruction dataset. This methodology emphasizes multimodal representation learning without substantial degradation of original language capabilities, achieving significant improvements over traditional cascaded systems.

Evaluation and Results

Ichigo was evaluated against existing speech LLMs using the AudioBench framework. It outperformed other open-source models in speech question-answering benchmarks, achieving scores of 67.8 and 67.2 on OpenHermes-Audio and ALPACA-Audio respectively. The model's latency to first token generation is notably efficient at 111 ms, outperforming comparable systems, evidencing its potential for real-time applications.

Implications and Future Directions

This research offers substantial implications for the field of multimodal AI. Ichigo's architecture sets a benchmark for efficiently integrating and processing speech and text in a unified model, with lower latency and enhanced performance over existing solutions. The work provides a viable framework for smaller research teams to develop competitive open-source speech-LLMs without the need for extensive resources.

Future developments may focus on overcoming the current limitations, such as improving emotional comprehension and extending context length for handling richer audio data and multi-turn dialogues. Additionally, further exploration into training stability with acoustic tokens may unlock even more advanced multimodal capabilities.

Conclusion

Overall, Ichigo represents a significant advancement in the seamless integration of speech and text processing within LLMs. Its innovative tokenized early-fusion approach not only enhances performance but also reduces complexity, paving the way for more accessible breakthroughs in the field of AI voice assistants. The paper contributes a robust framework that may inspire further innovations and applications in multimodal AI systems.

Reddit Logo Streamline Icon: https://streamlinehq.com