The paper "MM1: Methods, Analysis and Insights from Multimodal LLM Pre-training" presents a comprehensive paper on building high-performance Multimodal LLMs (MLLMs). The work emphasizes the significance of various architectural components and data choices through extensive ablations. The authors formulate design lessons intended to guide future research in the field.
The paper's contributions can be summarized as:
- Ablations on small-scale models to analyze the impact of model architecture and pre-training data choices.
- Identification of key trends related to image resolution, visual encoder loss, and visual encoder pre-training data.
- Demonstration of the importance of interleaved image-text and text-only training data for few-shot performance, and caption data for zero-shot performance.
- Scaling up the model to larger LLMs of 3B, 7B, and 30B parameters, including mixture-of-experts (MoE) models.
- Achieving state-of-the-art (SOTA) performance on pre-training metrics and competitive results on multimodal benchmarks after supervised fine-tuning (SFT).
The paper explores three main areas of design decisions: architecture, data, and training procedure.
Architecture Ablations
The paper analyzes components that enable an LLM to process visual data, focusing on visual encoder pre-training and bridging visual features to the LLM.
- Image Encoder Pre-training: The authors investigate the impact of image resolution and image encoder pre-training objective, using both contrastive losses (CLIP) and reconstructive losses (AIM). Key findings include the significant impact of image resolution, followed by model size and training data composition. Increasing image resolution from 224 to 336 results in an approximate 3% boost in metrics.
- Vision-Language Connector and Image Resolution: The research explores different ways to translate visual representations to the LLM's (LLM) space, considering the number of tokens representing the image ($64$ or $144$), image resolution ($224$ or $336$), and architectural options such as average pooling, attention pooling, and convolutional mapping. The number of visual tokens and image resolution are critical, while the specific vision-language connector architecture has less impact.
Pre-training Data Ablation
The paper emphasizes the importance of large-scale, task-appropriate data for training high-performance models and discusses data choices for the pre-training stage, including captioning data, interleaved image-text documents, and text-only data. The paper uses a simplified model setup for ablations and evaluates zero-shot and few-shot performance on captioning and visual question answering (VQA) tasks.
Key data lessons include:
- Interleaved data is crucial for few-shot and text-only performance, whereas captioning data improves zero-shot performance.
- Text-only data enhances few-shot and text-only performance.
- Careful mixing of image and text data leads to optimal multimodal performance while retaining strong text performance.
- Synthetic data boosts few-shot learning.
Final Model and Training Recipe
Based on the ablation results, the authors define the final recipe for MM1 multimodal pre-training:
- A ViT-H model with resolution, pre-trained with a CLIP objective on DFN-5B.
- A vision-language connector with 144 tokens, using the C-Abstractor architecture.
- A data mix of 45% interleaved image-text documents, 45% image-text pair documents, and 10% text-only documents.
The paper scales up the LLM size to 3B, 7B, and 30B parameters, initializing the image encoder and LLM decoder weights from in-house pre-trained models. Multimodal pre-training is then performed on the data mix for 200k steps.
The authors determine the optimal peak learning rate based on the number of non-embedding parameters using the following equation:
Where:
- is the optimal peak learning rate
- is the number of non-embedding parameters.
Supervised Fine-Tuning
The paper details SFT (Supervised Fine-Tuning) experiments on top of the pre-trained models, using a data mixture of instruction-response pairs generated by GPT-4, academic task-oriented vision-language (VL) datasets, and text-only SFT (Supervised Fine-Tuning) data. During SFT (Supervised Fine-Tuning), both the image encoder and the LLM backbone remain unfrozen. High-resolution SFT (Supervised Fine-Tuning) is supported through positional embedding interpolation and sub-image decomposition.
The results indicate that MM1 models achieve state-of-the-art performance compared to other models of the same size, particularly on VQAv2 (Visual Question Answering version 2), Text*VQA, ScienceQA, and MMMU. **MoE* (Mixture of Experts) models outperform their dense counterparts, demonstrating the potential of MoE (Mixture of Experts) for further scaling. The impact of image resolution and pre-training on SFT (Supervised Fine-Tuning) performance is also highlighted.
The authors conclude that the design lessons identified in this work can aid the community in building robust models beyond specific architectures or data strategies.