Mini-InternVL: A Resource-Efficient Approach to Multimodal Learning
The paper introduces Mini-InternVL, a compact multimodal LLM (MLLM) series, designed to address the challenges of training and deploying large-scale models on consumer-grade GPUs. The key innovation lies in achieving 90% performance of larger counterparts with only 5% of the parameters, significantly alleviating computational demands.
Overview
Mini-InternVL leverages InternViT-300M, a lightweight vision encoder initialized from CLIP and refined through knowledge distillation using InternViT-6B. The models integrate this vision encoder with pre-trained LLMs ranging from 1B to 4B parameters, such as Qwen2-0.5B, InternLM2-1.8B, and Phi-3-Mini.
The models demonstrate robust performance across general multimodal benchmarks like MMBench and ChartQA, achieving substantial efficiency gains. Remarkably, Mini-InternVL-4B exhibits 90% performance relative to InternVL2-76B while drastically reducing parameter counts.
Methodology
A unified adaptation framework is proposed to facilitate transfer across varied domains, including autonomous driving, medical images, and remote sensing. This paradigm standardizes model architecture and data formatting, supporting efficient deployment in domain-specific applications.
InternViT-300M is a core component, developed by distilling knowledge from InternViT-6B. This process enables the model to acquire and consolidate visual knowledge, thereby reducing reliance on extensive pre-training datasets.
The training strategy involves two stages: language-image alignment and visual instruction tuning. This ensures the model can effectively handle a variety of visual and linguistic elements.
Results
The paper presents comprehensive evaluations showcasing Mini-InternVL's capability in maintaining high performance with minimal parameters. In autonomous driving datasets, such as DriveLM-nuScenes, Mini-InternVL achieves competitive scores compared to models with ten times more parameters.
In medical imaging and remote sensing domains, the model significantly outperforms existing specialized frameworks. It's noteworthy that adaptation to these domains retains the model’s general multimodal capabilities.
Implications
Mini-InternVL sets a precedent for developing resource-efficient MLLMs applicable across numerous domains with constrained computational overhead. The use of knowledge distillation in developing a powerful yet compact vision encoder is particularly impactful.
This approach promotes the widespread adoption of MLLMs by lowering hardware requirements, enabling practical deployment even on edge devices.
Future Directions
Future research could focus on refining the knowledge distillation process to further enhance model capabilities while investigating additional domains for application. Moreover, exploring the balance between model size, performance, and domain-specific adaptation may yield further optimization insights.
Overall, Mini-InternVL offers substantial advancement in the development of efficient multimodal models, paving the way for broader accessibility and application of AI-driven solutions.