MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning (2409.20566v1)

Published 30 Sep 2024 in cs.CV, cs.CL, and cs.LG

Abstract: We present MM1.5, a new family of multimodal LLMs (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

PDF HTML Abstract

Overview of MM1.5: Methods, Analysis, and Insights from Multimodal LLM Fine-tuning

The paper "MM1.5: Methods, Analysis and Insights from Multimodal LLM Fine-tuning" presents an in-depth exploration of the development and enhancement strategies for MM1.5, a family of multimodal LLMs (MLLMs). These models are designed to significantly improve the understanding of text-rich images, visual referring and grounding, and multi-image reasoning.

Core Contributions

In advancing from the MM1 architecture, MM1.5 emphasizes a data-centric approach to model training, focusing on the systematic exploration of diverse data mixtures throughout the model training lifecycle. Key innovations in the MM1.5 model include:

Continual Pre-training:
- Data Selection: Integration of high-quality optical character recognition (OCR) data and synthetic captions.
- Resolution Impact: Demonstrated importance of high-resolution images (up to 4 Megapixels) to maximize the model's understanding of text-rich images.
- Performance Boost: Continual pre-training using OCR data significantly enhances text comprehension and general performance.
Supervised Fine-Tuning (SFT):
- Data Mixture Optimization: Careful selection and mixing of diverse datasets, categorized into single-image (e.g., text-rich, general, refer{content}ground), multi-image, and text-only groups.
- Balanced Capabilities: Extensive empirical studies highlighted the optimal mixtures to balance the model’s performance across various core capabilities.
Dynamic Image Splitting:
- High-Resolution Image Encoding: Adopting any-resolution approach to dynamically split images into sub-images for efficient processing, enabling support for arbitrary image aspect ratios and resolutions.
- Performance Gains: This strategy shows significant improvements, especially in text-rich tasks and visual referring and grounding.

Model Variants

MM1.5 encompasses a range of models from 1B to 30B parameters, including both dense and Mixture-of-Experts (MoE) variants:

Dense Models: Offer competitive performance even at smaller scales (1B and 3B), making them suitable for deployment on mobile devices.
MoE Models: Enhance performance by maintaining a constant number of activated parameters during inference, providing an effective scaling option.

Specialized Variants: MM1.5-Video and MM1.5-UI

Beyond the general-purpose models, the paper introduces two specialized variants to cater to specific applications:

MM1.5-Video: Focused on video understanding, evaluated in a training-free setup as well as fine-tuned with video-specific data, offering robust performance across multiple benchmarks.
MM1.5-UI: Tailored for mobile UI understanding, demonstrating state-of-the-art results in various UI-related benchmarks.

Empirical Evaluation

The empirical validation of MM1.5 includes comprehensive ablation studies focusing on:

Data Mixes in SFT: Demonstrates the impact of different datasets on various capabilities, with results guiding an optimal combination for balanced performance.
Impact of Image Resolution: Highlights the necessity of high-resolution images specifically in OCR-based continual pre-training for improving text-rich image understanding.
Dynamic Splitting Efficiency: Shows the efficiency of dynamic image splitting over static splitting, resulting in improved performance on text-rich tasks without a significant computational overhead.

Implications and Future Directions

MM1.5 sets a new benchmark for multimodal capabilities, demonstrating strong performance across a wide range of tasks while maintaining efficient deployment. The extensive empirical studies and careful data curation present valuable insights for future research in MLLM development. Specifically, the findings emphasize the importance of high-quality data and dynamic training strategies in enhancing model capabilities.

Future developments may focus on further improving the integration of video and UI understanding into the core model, exploring more refined synthetic data generation methods, and scaling the models with more diverse datasets to enhance their generalization abilities. The ongoing advancements in this domain hold significant promise for practical applications in AI, particularly in areas requiring robust multimodal understanding and reasoning.