Baichuan-Omni Technical Report (2410.08565v4)

Published 11 Oct 2024 in cs.AI, cs.CL, and cs.CV

Abstract: The salient multimodal capabilities and interactive experience of GPT-4o highlight its critical role in practical applications, yet it lacks a high-performing open-source counterpart. In this paper, we introduce Baichuan-omni, the first open-source 7B Multimodal LLM (MLLM) adept at concurrently processing and analyzing modalities of image, video, audio, and text, while delivering an advanced multimodal interactive experience and strong performance. We propose an effective multimodal training schema starting with 7B model and proceeding through two stages of multimodal alignment and multitask fine-tuning across audio, image, video, and text modal. This approach equips the LLM with the ability to handle visual and audio data effectively. Demonstrating strong performance across various omni-modal and multimodal benchmarks, we aim for this contribution to serve as a competitive baseline for the open-source community in advancing multimodal understanding and real-time interaction.

Citations (3)

View on Semantic Scholar

Summary

The paper introduces an innovative multimodal training schema that combines omni-modal data construction, dedicated alignment, and multitask fine-tuning to enhance cross-modal understanding.
It demonstrates strong performance across benchmarks, outperforming comparable models in text, image, video, and audio understanding tasks.
The model sets a new baseline for open-source multimodal LLMs, offering practical insights for real-time, interactive multi-task processing and future AI research directions.

Technical Overview of Baichuan-Omni

The paper "Baichuan-Omni Technical Report" introduces Baichuan-Omni, a pioneering open-source 7B Multimodal LLM (MLLM). This model is noteworthy for its ability to process and analyze various modalities—text, image, video, and audio—simultaneously, enhancing multimodal interactive experiences. The key contribution of this work lies in the proposed multimodal training schema, which combines multimodal alignment and multitask fine-tuning to enable comprehensive multimodal understanding and interaction.

Model Architecture and Training

Baichuan-Omni is built upon a structured multimodal training pipeline. Initially, the model uses a 7B architecture enhanced through:

Omni-Modal Data Construction: The dataset incorporates a mixture of open-source, synthetic, and internally annotated data across multiple modalities. This includes a blend of image captions, OCR data, audio for Automatic Speech Recognition (ASR), and video data. The data is selected and synthesized to cover over 200 tasks with approximately 600,000 instances.
Multimodal Alignment: This involves training distinct branches (image-language, video-language, audio-language) with dedicated pre-training followed by alignment using high-quality data. Vision-LLMs are trained with image-text pairs, while audio-language training uses ASR datasets. The multi-stage pre-training allows refinement in handling modality-specific representations.
Multitask Fine-tuning: In this phase, the model undergoes further fine-tuning with cross-modal interaction data. This step ensures that the model can handle a wide range of tasks using complex, realistic interactions between modalities.

Key Findings and Performance

Baichuan-Omni demonstrates strong performance across multiple benchmarks:

Text Understanding: Achieves competitive results on benchmarks like MMLU and CMMLU, outperforming other open-source models such as MAP-Neo and VITA in comprehensive Chinese benchmark tests.
Image and Video Understanding: Demonstrates superior performance in complex visual question answering tasks and general video understanding, exemplified by results on benchmarks such as MVBench and VideoMME.
Audio Processing: Outperforms other models in ASR tasks across datasets like Fleurs and KeSpeech, with significant advantages in transcription accuracy and robustness across various dialects and task complexities.

Implications and Future Directions

Baichuan-Omni sets a precedent for open-source models by providing a robust baseline for further exploration in multimodal AI. Its abilities highlight the practical implications for applications needing real-time, comprehensive multimodal interactions.

Future research could explore:

Enhancing text extraction and understanding from complex images such as documents and charts.
Extending video understanding capabilities to process longer sequences more effectively.
Developing integrated TTS systems with LLMs for improved speech synthesis directly from text.
Augmenting non-verbal sound understanding, which includes environmental audio cues alongside human speech.

Baichuan-Omni thus represents a significant step towards achieving a more generalized AI capable of understanding and interacting with the world through all prominent data modalities. This paper contributes valuable insights into the effective training of MLLMs and addresses core challenges in comprehensive multimodal AI.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AdinaYakup/status/1845783341754134922

https://twitter.com/susumuota/status/1846341616874684569