MiniCPM-V: A GPT-4V Level MLLM on Your Phone (2408.01800v1)

Published 3 Aug 2024 in cs.CV

Abstract: The recent surge of Multimodal LLMs (MLLMs) has fundamentally reshaped the landscape of AI research and industry, shedding light on a promising path toward the next AI milestone. However, significant challenges remain preventing MLLMs from being practical in real-world applications. The most notable challenge comes from the huge cost of running an MLLM with a massive number of parameters and extensive computation. As a result, most MLLMs need to be deployed on high-performing cloud servers, which greatly limits their application scopes such as mobile, offline, energy-sensitive, and privacy-protective scenarios. In this work, we present MiniCPM-V, a series of efficient MLLMs deployable on end-side devices. By integrating the latest MLLM techniques in architecture, pretraining and alignment, the latest MiniCPM-Llama3-V 2.5 has several notable features: (1) Strong performance, outperforming GPT-4V-1106, Gemini Pro and Claude 3 on OpenCompass, a comprehensive evaluation over 11 popular benchmarks, (2) strong OCR capability and 1.8M pixel high-resolution image perception at any aspect ratio, (3) trustworthy behavior with low hallucination rates, (4) multilingual support for 30+ languages, and (5) efficient deployment on mobile phones. More importantly, MiniCPM-V can be viewed as a representative example of a promising trend: The model sizes for achieving usable (e.g., GPT-4V) level performance are rapidly decreasing, along with the fast growth of end-side computation capacity. This jointly shows that GPT-4V level MLLMs deployed on end devices are becoming increasingly possible, unlocking a wider spectrum of real-world AI applications in the near future.

PDF HTML Abstract

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

The paper "MiniCPM-V: A GPT-4V Level MLLM on Your Phone" elaborates on the advancements and implications of the MiniCPM-V series, a set of highly efficient Multimodal LLMs (MLLMs) designed for deployment on end-side devices such as smartphones. This work aims to address the practicality issues inherent in current MLLMs, which generally require computational resources far beyond the capacity of mobile devices.

Key Contributions

The primary contributions of the MiniCPM-V series are multi-faceted:

Model Efficiency: The MiniCPM-V models demonstrate that it is possible to achieve GPT-4V level performance with a much smaller model size.
OCR Capabilities: The models show strong OCR performance across multiple high-resolution image inputs, demonstrating accuracy in various real-world text recognition tasks.
Trustworthiness: The integration of techniques like RLAIF-V helps reduce hallucination rates, making the models more reliable.
Multilingual Support: The models support over 30 languages, expanding their usability in global contexts.
End-Side Deployment: Emphasis on efficient deployment on mobile devices, incorporating quantization, memory optimization, and NPU acceleration.

Model Architecture and Techniques

The architecture of the MiniCPM-V series integrates a visual encoder, compression layer, and LLM. A notable feature is the adaptive visual encoding strategy that effectively handles high-resolution images with any aspect ratio. This consists of image partitioning, slice encoding with interpolated position embeddings, token compression, and spatial schema integration.

The training pipeline includes three stages: initial warm-up of the compression layer, extension of visual encoder input resolution, and high-resolution pre-training with a focus on OCR. The supervised fine-tuning phase leverages high-quality datasets to enhance model capabilities further, while the RLAIF-V method is used to align model behavior based on AI/human feedback, ensuring trustworthy responses.

Experimental Results

On various popular benchmarks, the MiniCPM-Llama3-V 2.5 model notably outperforms a range of both open-source and proprietary models in tasks involving multimodal understanding, reasoning, and OCR capabilities:

OCRBench: Scores of 725 surpass both open-source models and proprietary systems.
Object Hallucination: Achieves lower hallucination rates than powerful models like GPT-4V.

These results indicate that MiniCPM-V can strike a balance between performance and efficiency, crucial for applications on resource-constrained devices.

End-Side Deployment

Assessing the deployment challenge, the research covers both basic practices such as quantization and advanced practices like memory usage and compilation optimization. Leveraging NPUs for visual encoding demonstrates further reduction in latency, revealing significant potential for these models in real-world end-side applications.

Implications and Future Directions

The implications of this research are profound both theoretically and practically:

Theoretical advancements include the feasibility of achieving high-level AI performance with smaller, efficient models — suggesting potential parallels to Moore's Law in the context of MLLMs.
Practical advancements involve the direct application of such models in mobile technology, expanding the reach and utility of AI.

Looking forward, continued research could focus on improving multimodal understanding capabilities and extending model functionalities to other modalities like video and audio. Moreover, optimizations in hardware (specifically for MLLMs) and deployment frameworks could further enhance the applicability of these models.

In conclusion, the MiniCPM-V series exemplifies a promising trend towards the miniaturization of state-of-the-art AI technologies, making advanced AI accessible on ubiquitous mobile devices. The balance between performance and usability marks a significant step forward in deploying capable AI systems in everyday technology, potentially transforming a wide array of applications.