Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities (2410.11190v3)

Published 15 Oct 2024 in eess.AS, cs.AI, cs.CV, cs.LG, and cs.SD

Abstract: GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal LLMs. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the LLM to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a command-based interruption mechanism, enabling more flexible interaction with users. To the best of our knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have similar form of functionality, and we hope it can offer valuable insights for subsequent research.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Mini-Omni2, a unified model that leverages pre-trained visual and speech encoders to replicate key GPT-4o functionalities.
The model employs a three-stage training process—modality adaptation, alignment, and post-training—to enhance mixed-modality integration and real-time streaming outputs.
The research provides open-source synthetic datasets and training scripts, setting a foundation for future advancements in multi-modal AI and human-computer interaction.

The paper "Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities" introduces Mini-Omni2, a multi-modal visual-audio assistant designed to emulate the capabilities of the multi-modal model GPT-4o. The authors present Mini-Omni2 as a response to the challenges of integrating multiple modalities—vision, speech, and text—into a single cohesive model. This model aspires to extend GPT-4o's functionalities in the open-source domain, maintaining distinct modality performances, while supporting end-to-end streaming responses and duplex interactions.

Methodology

The development of Mini-Omni2 employs a unified architecture that fuses pre-trained visual encoders and speech models to ensure robust performance across tasks involving different modalities. The model utilizes the CLIP visual encoder and the Whisper audio encoder, which were chosen for their pre-trained efficiencies in handling respective tasks. The language capabilities are derived from the Qwen2-0.5B base model build on Llama architecture, further integrated to simultaneously process mixed-modality inputs and produce real-time streaming outputs.

To effectively train Mini-Omni2, a three-stage process is implemented:

Multimodal Encoder Adaptation: This initial phase focuses on adapting pre-trained encoders to align with the LLM’s embedding space. The aim is to minimize the number of parameter adjustments needed when incorporating multi-modal data.
Modality Alignment: Here, the model's focus shifts to enhancing the logical reasoning capabilities across modalities without altering the adapter layers. Task-specific knowledge from text-based question-answering is transferred to visual and audio modalities.
Post-training and Expansion: During this stage, full modality integration is refined with the inclusion of audio outputs, and special attention is given to developing a flexible interruption mechanism enhancing duplex communication.

Contributions

The researchers claim several key contributions in the paper:

The introduction of an open-source multi-modal LLM replicates GPT-4o's core functionalities in vision, speech, and text with duplex interaction features.
The exploration of command-based interruption mechanisms demonstrating the model's ability to process semantic cues to manage output streams.
The open-sourcing of synthetic datasets and training scripts to offer the research community valuable resources for further exploration of multi-modal models.

Results and Implications

Mini-Omni2's performance evaluation focuses primarily on traditional tasks, such as image captioning and speech recognition. This pertains to the model's empirical testing where it maintains a comparable performance to its base models, such as Whisper, and even shows improved robustness in speech recognition under diverse conditions, as depicted by its performance metrics on the Librispeech test and dev sets.

The innovative interruption mechanism based on semantic cues lends an adaptive and natural interaction capability to the model, pointing towards potential enhancements in human-computer interface systems. However, stability in full-duplex capabilities remains an area flagged for ongoing refinement.

Future Directions

The paper identifies several avenues for further research and development:

Scaling of model size and dataset depth could enhance performance potential.
Improvements in the diversity and control of audio output characteristics, including emotional tone and stylistic variance.
Advanced primary semantic interruption mechanisms aimed at refining user interaction stability and responsiveness.

Conclusion

The Mini-Omni2 project represents a significant step towards developing a versatile, open-source multi-modal model that approaches the functionality offered by GPT-4o. By addressing the technical demands of modality integration and interaction fluidity, Mini-Omni2 stands as both a tangible tool for current applications and a foundation for future research in multi-modal AI systems. The release of the associated datasets and methodologies aims to spur continued advancements within the research community, facilitating future explorations into the dynamic capabilities of open-source LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AudioAndSpeech/status/1854100349889511562

https://twitter.com/ZiebaMat/status/1848740944754913604

https://twitter.com/arXivGPT/status/1848816677607084129

YouTube

Show All Videos