Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance (2410.16261v3)

Published 21 Oct 2024 in cs.CV

Abstract: Multimodal LLMs (MLLMs) have demonstrated impressive performance in vision-language tasks across a broad spectrum of domains. However, the large model scale and associated high computational costs pose significant challenges for training and deploying MLLMs on consumer-grade GPUs or edge devices, thereby hindering their widespread application. In this work, we introduce Mini-InternVL, a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters. This significant improvement in efficiency and effectiveness makes our models more accessible and applicable in various real-world scenarios. To further promote the adoption of our models, we develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks, including autonomous driving, medical images, and remote sensing. We believe that our study can provide valuable insights and resources to advance the development of efficient and effective MLLMs. Code is available at https://github.com/OpenGVLab/InternVL.

PDF HTML Abstract

Mini-InternVL: A Resource-Efficient Approach to Multimodal Learning

The paper introduces Mini-InternVL, a compact multimodal LLM (MLLM) series, designed to address the challenges of training and deploying large-scale models on consumer-grade GPUs. The key innovation lies in achieving 90% performance of larger counterparts with only 5% of the parameters, significantly alleviating computational demands.

Overview

Mini-InternVL leverages InternViT-300M, a lightweight vision encoder initialized from CLIP and refined through knowledge distillation using InternViT-6B. The models integrate this vision encoder with pre-trained LLMs ranging from 1B to 4B parameters, such as Qwen2-0.5B, InternLM2-1.8B, and Phi-3-Mini.

The models demonstrate robust performance across general multimodal benchmarks like MMBench and ChartQA, achieving substantial efficiency gains. Remarkably, Mini-InternVL-4B exhibits 90% performance relative to InternVL2-76B while drastically reducing parameter counts.

Methodology

A unified adaptation framework is proposed to facilitate transfer across varied domains, including autonomous driving, medical images, and remote sensing. This paradigm standardizes model architecture and data formatting, supporting efficient deployment in domain-specific applications.

InternViT-300M is a core component, developed by distilling knowledge from InternViT-6B. This process enables the model to acquire and consolidate visual knowledge, thereby reducing reliance on extensive pre-training datasets.

The training strategy involves two stages: language-image alignment and visual instruction tuning. This ensures the model can effectively handle a variety of visual and linguistic elements.

Results

The paper presents comprehensive evaluations showcasing Mini-InternVL's capability in maintaining high performance with minimal parameters. In autonomous driving datasets, such as DriveLM-nuScenes, Mini-InternVL achieves competitive scores compared to models with ten times more parameters.

In medical imaging and remote sensing domains, the model significantly outperforms existing specialized frameworks. It's noteworthy that adaptation to these domains retains the model’s general multimodal capabilities.

Implications

Mini-InternVL sets a precedent for developing resource-efficient MLLMs applicable across numerous domains with constrained computational overhead. The use of knowledge distillation in developing a powerful yet compact vision encoder is particularly impactful.

This approach promotes the widespread adoption of MLLMs by lowering hardware requirements, enabling practical deployment even on edge devices.

Future Directions

Future research could focus on refining the knowledge distillation process to further enhance model capabilities while investigating additional domains for application. Moreover, exploring the balance between model size, performance, and domain-specific adaptation may yield further optimization insights.

Overall, Mini-InternVL offers substantial advancement in the development of efficient multimodal models, paving the way for broader accessibility and application of AI-driven solutions.