- The paper introduces Baichuan-Omni-1.5, a model that integrates end-to-end audio generation with robust visual and textual processing using a multi-stage training strategy.
- It employs a custom Baichuan-Audio-Tokenizer and flow matching-based decoder to capture both semantic and acoustic details, enhancing multimodal interactions.
- It achieves superior performance on benchmarks like OpenMM-Medical (83.8%) and MMLU (72.2%), demonstrating its competitive edge over similar omni-modal models.
The paper introduces Baichuan-Omni-1.5, a novel omni-modal model featuring end-to-end audio generation capabilities. The model leverages approximately 500B tokens of multimodal data, an audio-tokenizer (Baichuan-Audio-Tokenizer), and a multi-stage training strategy to achieve seamless, high-quality interaction across modalities without compromising individual modality performance. Baichuan-Omni-1.5 exhibits competitive performance, rivaling models like Qwen2-VL-72B, particularly in multimodal medical benchmarks.
The key components and contributions include:
- A comprehensive data cleaning and synthesis pipeline for multimodal data.
- Baichuan-Audio-Tokenizer to capture both semantic and acoustic information from audio, enhancing compatibility with MLLMs.
- A multi-stage training strategy for effective synergy across all modalities.
- OpenAudio-Bench, an open-source audio understanding and generation benchmark for evaluating end-to-end audio capabilities.
- OpenMM-Medical, a comprehensive medical understanding benchmark, achieving SOTA performance on it using a 7B LLM, surpassing Qwen2-VL-72B's score of 80.7\% with a score of 83.8\%.
The architecture of Baichuan-Omni-1.5 comprises a visual branch, an audio branch, and a pre-trained LLM backbone. The visual branch employs NaViT, similar to Qwen2-VL, for processing image and video inputs, along with a two-layer MLP visual projector. The audio branch incorporates the Baichuan-Audio-Tokenizer and a flow matching-based decoder for end-to-end speech processing. The Baichuan-Audio-Tokenizer is based on Residual Vector Quantization (RVQ) with a frame rate of 12.5 Hz.
The training strategy involves a multi-stage approach:
- Image-Text Pretrain: Extends an LLM to process and understand visual input.
- Image-Audio-Text Pretrain: Expands the LLM to understand audio data in an end-to-end manner, incorporating the Baichuan-Audio-Tokenizer, a newly introduced audio embedding layer, and an independent audio head.
- Omni-Modal Pretrain: Trains all parameters using high-quality cross-modal interaction datasets, extending the maximum sequence length to 64k.
- Multimodal Supervised Fine-Tuning (SFT): Enhances the model's instruction-following capabilities across a range of tasks, utilizing a dataset of approximately 17 million data pairs across various modalities.
The model was evaluated against proprietary models such as GPT4o mini and GPT4o, open-source general models like MAP-Neo, Qwen1.5-Chat, Llama3-Instruct, OLMo, and open-source omni-modal models such as VITA-1.0, VITA-1.5, Baichuan-Omni, and MiniCPM-o 2.6 across text, image, video, audio, medical, and omni benchmarks.
The evaluation includes the following benchmarks: MMLU, CMMLU, AGIEval, C-Eval, and GAOKAO-Bench. Baichuan-Omni-1.5 demonstrates strong performance on pure-text benchmarks. For example, on MMLU, Llama3-Instruct achieves 67.1\%, while Baichuan-Omni-1.5 reaches 72.2\%.
The evaluation includes the following benchmarks: MMBench-EN, MMBench-CN, SEEDBench, RealWorldQA, MMMU, MathVista, TextVQA, OCRBench, ChartQA, and HallusionBench. The model outperforms the latest open-source models, VITA-1.5 and MiniCPM-o 2.6, on most benchmarks.
The evaluation includes the following benchmarks: Perception-Test, MVBench, VideoMME, and EgoSchema, ActivityNet-QA and MSVD-QA. Baichuan-Omni-1.5 demonstrates comparable performance to proprietary models on benchmarks like Egoschema and VideoMME, and achieves strong performance across open-source multimodal models.
The evaluation includes the following benchmarks: OpenAudioBench which includes Reasoning QA, Spoken Llama Questions, Web Questions, TriviaQA, and AlpacaEval. In the s→t setting, Baichuan-Omni-1.5 significantly outperforms models of the same size in Reasoning QA and AlpacaEval, achieving scores of 50 and 7.79, respectively.
The evaluation includes the OmniBench benchmark with the following setups: 1) Image + Audio, 2) Image Caption + Audio, 3) Image + Audio Transcript, 4) Image Caption + Audio Transcript. Compared to the omni-modal model MiniCPM-o 2.6, Baichuan-Omni-1.5 outperforms it in three of the four settings.
The evaluation includes GMAI-MMBench and OpenMM-Medical. Baichuan-Omni-1.5 achieves the highest performance in both. On OpenMM-Medical, MiniCPM-o 2.6 gets 73.6\%, while Baichuan-Omni-1.5 gets 83.8%.