MIO: A Foundation Model on Multimodal Tokens
The paper "MIO: A Foundation Model on Multimodal Tokens" introduces MIO, a comprehensive foundation model capable of understanding and generating content across multiple modalities, specifically text, speech, images, and videos. The objective of this model is to overcome the existing limitations in multimodal LLMs (MM-LLMs) and establish a framework for any-to-any generation, which implies that the model can process input and generate output across diverse formats seamlessly.
Key Contributions
The MIO model is developed to address the gap left by existing models like GPT-4o, which, despite showcasing the promise of any-to-any LLMs for real-world applications, is closed-source and does not support multimodal interleaved sequence generation. The key contributions of the MIO model can be summarized as follows:
- Multimodal Tokenization: Utilizes discrete tokens for speech, text, images, and video frames through specialized tokenizers.
- Causal Multimodal Modeling: Employs a unified causal modeling framework across all modalities, integrating discrete tokens into the LLM's autoregressive training paradigm.
- Four-Stage Training Process: Incorporates a detailed multi-stage training regimen:
- Alignment Pre-training
- Interleaved Pre-training
- Speech-enhanced Pre-training
- Comprehensive supervised fine-tuning
- Competitive Performance: Demonstrates outstanding performance across multiple benchmarks, often surpassing various dual-modal and modality-specific baseline models.
Methodology
Multimodal Tokenization and Causal Multimodal Modeling
The MIO model utilizes discrete tokenization strategies for different modalities:
- Images and Videos: Tokenized using SEED-Tokenizer, which employs a ViT architecture derivative and causal Q-Former to encode images into discrete tokens.
- Speech: Handled through SpeechTokenizer employing an 8-layer RVQ, transforming speech into content and timbre tokens.
This consistent tokenization approach aligns non-textual modalities with the text space, enabling the model to treat these modalities like "foreign languages" within a textual sequence. The unified causal modeling allows the model to manage all modalities homogenously, optimizing cross-entropy loss for the next-token prediction.
Multi-stage Training
The multi-stage training strategy is designed to progressively incorporate and align multimodal data:
- Alignment Pre-training: Ensures initial alignment of multimodal tokens with text data.
- Interleaved Pre-training: Introduces richer contextual semantics via interleaved training on multimodal data.
- Speech-enhanced Pre-training: Increases the proportion of speech data gradually, addressing token disparity across modalities.
- Comprehensive Supervised Fine-tuning: Encompasses 16 diverse tasks across 34 datasets, refining the model’s understanding and generation capabilities across all supported modalities.
Experimental Evaluation
The evaluation covers multiple domains, emphasizing MIO's broad capabilities:
- Image Understanding and Generation: Benchmarks include MS-COCO, VQAv2, OK-VQA, VizWiz, and SEED-Bench, where MIO shows competitive or superior performance. For image generation, MIO employs CLIP-I scores and demonstrates substantial capabilities in both descriptive and context-aware image generation.
- Speech-related Tasks: Evaluates ASR and TTS abilities, exhibiting competitive word error rates (WER) compared to speech-specific baselines like Wav2vec and Whisper.
- Video Understanding and Generation: Assesses performance on MSVDQA and MSRVTT-QA tasks, where MIO achieves the highest scores against all baselines.
Advanced Capabilities
MIO showcases emergent abilities enabled by its any-to-any and multimodal interleaved sequence generation features. Examples include complex tasks such as:
- Interleaved video-text generation
- Chain-of-visual-thought reasoning
- Visual storytelling
- Multimodal in-context learning
- Instructional image editing
These capabilities are visualized through various qualitative demonstrations underscoring the model's proficiency in handling intricate multimodal interactions beyond conventional benchmarks.
Implications and Future Developments
The practical implications of MIO are substantial, with its ability to generate context-aware multimodal content positioning it as a versatile tool in AI research and real-world applications, such as content creation, interactive AI systems, and multimodal HCI.
Theoretical implications lie in its multimodal integration framework. Future developments could focus on:
- Enhancing the resolution and detail fidelity of generated content
- Addressing fine-grained control over speech timbre
- Exploring raw continuous video and highly detailed image generation
Overall, MIO sets a robust precedent for future multimodal LLM development, combining practicality with theoretical advancements in AI-driven multimodal understanding and generation.