MIO: A Foundation Model on Multimodal Tokens (2409.17692v2)

Published 26 Sep 2024 in cs.CL, cs.AI, and cs.LG

Abstract: In this paper, we introduce MIO, a novel foundation model built on multimodal tokens, capable of understanding and generating speech, text, images, and videos in an end-to-end, autoregressive manner. While the emergence of LLMs and multimodal LLMs (MM-LLMs) propels advancements in artificial general intelligence through their versatile capabilities, they still lack true any-to-any understanding and generation. Recently, the release of GPT-4o has showcased the remarkable potential of any-to-any LLMs for complex real-world tasks, enabling omnidirectional input and output across images, speech, and text. However, it is closed-source and does not support the generation of multimodal interleaved sequences. To address this gap, we present MIO, which is trained on a mixture of discrete tokens across four modalities using causal multimodal modeling. MIO undergoes a four-stage training process: (1) alignment pre-training, (2) interleaved pre-training, (3) speech-enhanced pre-training, and (4) comprehensive supervised fine-tuning on diverse textual, visual, and speech tasks. Our experimental results indicate that MIO exhibits competitive, and in some cases superior, performance compared to previous dual-modal baselines, any-to-any model baselines, and even modality-specific baselines. Moreover, MIO demonstrates advanced capabilities inherent to its any-to-any feature, such as interleaved video-text generation, chain-of-visual-thought reasoning, visual guideline generation, instructional image editing, etc.

PDF HTML Abstract

MIO: A Foundation Model on Multimodal Tokens

The paper "MIO: A Foundation Model on Multimodal Tokens" introduces MIO, a comprehensive foundation model capable of understanding and generating content across multiple modalities, specifically text, speech, images, and videos. The objective of this model is to overcome the existing limitations in multimodal LLMs (MM-LLMs) and establish a framework for any-to-any generation, which implies that the model can process input and generate output across diverse formats seamlessly.

Key Contributions

The MIO model is developed to address the gap left by existing models like GPT-4o, which, despite showcasing the promise of any-to-any LLMs for real-world applications, is closed-source and does not support multimodal interleaved sequence generation. The key contributions of the MIO model can be summarized as follows:

Multimodal Tokenization: Utilizes discrete tokens for speech, text, images, and video frames through specialized tokenizers.
Causal Multimodal Modeling: Employs a unified causal modeling framework across all modalities, integrating discrete tokens into the LLM's autoregressive training paradigm.
Four-Stage Training Process: Incorporates a detailed multi-stage training regimen:
- Alignment Pre-training
- Interleaved Pre-training
- Speech-enhanced Pre-training
- Comprehensive supervised fine-tuning
Competitive Performance: Demonstrates outstanding performance across multiple benchmarks, often surpassing various dual-modal and modality-specific baseline models.

Methodology

Multimodal Tokenization and Causal Multimodal Modeling

The MIO model utilizes discrete tokenization strategies for different modalities:

Images and Videos: Tokenized using SEED-Tokenizer, which employs a ViT architecture derivative and causal Q-Former to encode images into discrete tokens.
Speech: Handled through SpeechTokenizer employing an 8-layer RVQ, transforming speech into content and timbre tokens.

This consistent tokenization approach aligns non-textual modalities with the text space, enabling the model to treat these modalities like "foreign languages" within a textual sequence. The unified causal modeling allows the model to manage all modalities homogenously, optimizing cross-entropy loss for the next-token prediction.

Multi-stage Training

The multi-stage training strategy is designed to progressively incorporate and align multimodal data:

Alignment Pre-training: Ensures initial alignment of multimodal tokens with text data.
Interleaved Pre-training: Introduces richer contextual semantics via interleaved training on multimodal data.
Speech-enhanced Pre-training: Increases the proportion of speech data gradually, addressing token disparity across modalities.
Comprehensive Supervised Fine-tuning: Encompasses 16 diverse tasks across 34 datasets, refining the model’s understanding and generation capabilities across all supported modalities.

Experimental Evaluation

The evaluation covers multiple domains, emphasizing MIO's broad capabilities:

Image Understanding and Generation: Benchmarks include MS-COCO, VQAv2, OK-VQA, VizWiz, and SEED-Bench, where MIO shows competitive or superior performance. For image generation, MIO employs CLIP-I scores and demonstrates substantial capabilities in both descriptive and context-aware image generation.
Speech-related Tasks: Evaluates ASR and TTS abilities, exhibiting competitive word error rates (WER) compared to speech-specific baselines like Wav2vec and Whisper.
Video Understanding and Generation: Assesses performance on MSVDQA and MSRVTT-QA tasks, where MIO achieves the highest scores against all baselines.

Advanced Capabilities

MIO showcases emergent abilities enabled by its any-to-any and multimodal interleaved sequence generation features. Examples include complex tasks such as:

Interleaved video-text generation
Chain-of-visual-thought reasoning
Visual storytelling
Multimodal in-context learning
Instructional image editing

These capabilities are visualized through various qualitative demonstrations underscoring the model's proficiency in handling intricate multimodal interactions beyond conventional benchmarks.

Implications and Future Developments

The practical implications of MIO are substantial, with its ability to generate context-aware multimodal content positioning it as a versatile tool in AI research and real-world applications, such as content creation, interactive AI systems, and multimodal HCI.

Theoretical implications lie in its multimodal integration framework. Future developments could focus on:

Enhancing the resolution and detail fidelity of generated content
Addressing fine-grained control over speech timbre
Exploring raw continuous video and highly detailed image generation

Overall, MIO sets a robust precedent for future multimodal LLM development, combining practicality with theoretical advancements in AI-driven multimodal understanding and generation.

PDF Markdown Bookmark Chat (Pro)

Authors (17)

Zekun Wang (50 papers)
King Zhu (8 papers)
Chunpu Xu (16 papers)
Wangchunshu Zhou (73 papers)
Jiaheng Liu (100 papers)
Yibo Zhang (41 papers)
Jiashuo Wang (19 papers)
Ning Shi (16 papers)
Siyu Li (53 papers)
Yizhi Li (43 papers)
Haoran Que (10 papers)
Zhaoxiang Zhang (161 papers)
Yuanxing Zhang (30 papers)
Ge Zhang (170 papers)
Ke Xu (309 papers)
Jie Fu (229 papers)
Wenhao Huang (98 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/omarsar0/status/1839715886040789013

https://twitter.com/papers_anon/status/1839498331393032676

https://twitter.com/ZenMoore1/status/1840602843507810653

https://twitter.com/neurallambda/status/1840747425520349305

https://twitter.com/gm8xx8/status/1839491017365397998

https://twitter.com/ZenMoore1/status/1938562986962248190