Stream-Omni: Simultaneous Multimodal Interactions with Large Language-Vision-Speech Model (2506.13642v2)

Published 16 Jun 2025 in cs.AI, cs.CL, cs.CV, cs.SD, and eess.AS

Abstract: The emergence of GPT-4o-like large multimodal models (LMMs) has raised the exploration of integrating text, vision, and speech modalities to support more flexible multimodal interaction. Existing LMMs typically concatenate representation of modalities along the sequence dimension and feed them into a LLM backbone. While sequence-dimension concatenation is straightforward for modality integration, it often relies heavily on large-scale data to learn modality alignments. In this paper, we aim to model the relationships between modalities more purposefully, thereby achieving more efficient and flexible modality alignments. To this end, we propose Stream-Omni, a large language-vision-speech model with efficient modality alignments, which can simultaneously support interactions under various modality combinations. Stream-Omni employs LLM as the backbone and aligns the vision and speech to the text based on their relationships. For vision that is semantically complementary to text, Stream-Omni uses sequence-dimension concatenation to achieve vision-text alignment. For speech that is semantically consistent with text, Stream-Omni introduces a CTC-based layer-dimension mapping to achieve speech-text alignment. In this way, Stream-Omni can achieve modality alignments with less data (especially speech), enabling the transfer of text capabilities to other modalities. Experiments on various benchmarks demonstrate that Stream-Omni achieves strong performance on visual understanding, speech interaction, and vision-grounded speech interaction tasks. Owing to the layer-dimensional mapping, Stream-Omni can simultaneously provide intermediate text outputs (such as ASR transcriptions and model responses) during speech interaction, offering users a comprehensive multimodal experience.

Summary

The paper introduces a novel framework that strategically aligns language, vision, and speech modalities to enhance efficiency in multimodal integration.
It employs a unique CTC-based mapping for speech-text alignment and uses intermediate text outputs to enrich user interactions.
Experimental results demonstrate robust performance in language and vision-grounded speech tasks using only 23,000 hours of training data.

Stream-Omni: Enhancing Multimodal Interactions with Large Language-Vision-Speech Models

The paper presents Stream-Omni, an advanced approach to simultaneity in multimodal interaction by leveraging large language-vision-speech models (LMMs). This model integrates text, vision, and speech modalities in a manner that supports flexible multimodal interactions with improved efficiency and adaptability in modality alignments.

Framework and Methodology

Existing multimodal models, typified by GPT-4o, face challenges in modality alignment, particularly balancing efficiency with the reliance on large datasets. Traditional approaches concatenate modality representations sequentially which demands significant data for learning modalities' alignments. Stream-Omni introduces a strategic approach where alignments between modalities are purposefully modeled to improve efficiency and adaptability.

Key features of Stream-Omni include:

Multimodal Integration: It employs a LLM as the backbone to align different modalities based on their roles and relationships. Vision modality uses sequence-dimension concatenation due to its complementary semantic relationship with text, whereas speech modality benefits from layer-dimension mapping because of its higher semantic consistency with text.
Utilization of CTC-based Mapping: For aligning speech to text, Stream-Omni utilizes Connectionist Temporal Classification (CTC) to enable effective speech-text mapping and facilitate transition of the text model’s capabilities to the speech modality with lower data requirements.
Intermediate Text Outputs: During speech interactions, Stream-Omni generates meaningful intermediate text outputs such as ASR transcriptions and model responses, enriching the user experience through a comprehensive multimodal interaction framework.

Experimental Validation

Stream-Omni's performance has been validated across various benchmarks, showcasing its strong visual understanding, speech interaction, and vision-grounded speech interaction capabilities. Notably, with only a dataset of 23,000 hours, Stream-Omni demonstrates impressive results in speech-language tasks typically requiring significantly more data. The model architecture allows for simultaneous generation and interaction across different modalities enhancing the versatility of LMMs.

Implications and Future Research Directions

The implications for AI advancement are significant. Stream-Omni sets a precedent for combining modalities in a resource-efficient manner while offering adaptive interaction capabilities. This could encourage new applications in areas requiring complex multimodal processes like augmented reality, interactive media, and sophisticated user interfaces that necessitate real-time processing of diverse data streams.

Future research can extend these principles into creating frameworks that further diminish data dependency or optimize processing speed and latency. Additionally, exploration into more dynamic adaptation strategies retaining model robustness across variable data scales or environments could augment LMM effectiveness in real-world applications.

Conclusion

Stream-Omni contributes a novel approach for integrated multimodal interactions, enhancing the capabilities of LMMs beyond traditional data-dominated models by innovatively aligning modalities according to their semantic roles and relationships. This framework not only advances practical applications but opens doors for theoretical explorations into higher efficiency and integration in AI-driven systems.

Related Papers

Tweets

https://twitter.com/AdinaYakup/status/1935805269780709744

https://twitter.com/_akhaliq/status/1935154620751630792

https://twitter.com/HuggingPapers/status/1935189233834012816

YouTube

Show All Videos