Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MobileVLM V2: Faster and Stronger Baseline for Vision Language Model (2402.03766v1)

Published 6 Feb 2024 in cs.CV and cs.AI

Abstract: We introduce MobileVLM V2, a family of significantly improved vision LLMs upon MobileVLM, which proves that a delicate orchestration of novel architectural design, an improved training scheme tailored for mobile VLMs, and rich high-quality dataset curation can substantially benefit VLMs' performance. Specifically, MobileVLM V2 1.7B achieves better or on-par performance on standard VLM benchmarks compared with much larger VLMs at the 3B scale. Notably, our 3B model outperforms a large variety of VLMs at the 7B+ scale. Our models will be released at https://github.com/Meituan-AutoML/MobileVLM .

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xiangxiang Chu (62 papers)
  2. Limeng Qiao (11 papers)
  3. Xinyu Zhang (296 papers)
  4. Shuang Xu (59 papers)
  5. Fei Wei (35 papers)
  6. Yang Yang (884 papers)
  7. Xiaofei Sun (36 papers)
  8. Yiming Hu (28 papers)
  9. Xinyang Lin (5 papers)
  10. Bo Zhang (633 papers)
  11. Chunhua Shen (404 papers)
Citations (75)

Summary

Introduction

The exploration of Vision LLMs (VLMs) is gaining vigorous momentum in AI research circles, with unprecedented integration capabilities when combined with LLMs. New iterations are constantly emerging, pushing the boundaries of efficiency and performance. Mobile VLM V2 emerges from this ongoing innovation, building upon its predecessor, MobileVLM, and proposes a significantly improved VLM tailored for devices with constrained computational resources, such as mobile devices and embedded systems.

Related Work

A look at the lineage of VLMs reveals a strong emphasis on optimizing a mix of model accuracy and efficiency. The incorporation of Mixture of Experts methods to VLMs noted in MoE-LLaVA offers compelling results yet entails complexities in deployment, especially to edge devices. The nascent but impactful MobileVLM introduced hardware-friendly architecture to optimize performance for on-device deployment. The field is trending towards versatile models that streamline architectures and refine datasets for efficient training and deployment. The recently unveiled innovations like ShareGPT4V highlight impactful strides made in the precise alignment of vision-language features.

Method

MobileVLM V2 sees substantial improvements through three core enhancements: expanded high-quality training datasets, robust training strategies, and an innovatively lightweight projector - LDPv2. The datasets now include 1.2 million high-quality image-text pairs and additional academic datasets to bolster model versatility. The training strategy fully exploits the projector and LLM parameters, enhancing model capabilities to handle diverse tasks. Additionally, the new projector design condenses the number of image tokens while enriching positional information without performance detriment.

Experiment

The model's preeminent contributions manifest in a state-of-the-art tradeoff between performance and inference speed across essential benchmarks, with MobileVLM V2 scaling up to 7B parameters and eclipsing the previous best models. Even at a smaller 3B scale, it achieves superior outcomes than many higher-scaled VLMs, highlighting the efficiency and power of its advanced architecture. In practical terms, MobileVLM V2 operates up to 1.65× faster than other models at similar scales, without sacrificing accuracy. Scaling up the model to 7B parameters further widens the gap in performance. The ability to remove the token reduction component without affecting the latency speed aligns with ShareGPT4V, reinforcing the model's flexible architecture.

Conclusion

MobileVLM V2 significantly broadens the scope and feasibility of deploying advanced AI models on mobile and edge devices. It demonstrates robust improvements over its precursor with better data utilization, improved training, and a more potent projector mechanism. The model's superiority in performance coupled with an edge in inference efficiency adds invaluable credence to the perpetual advancement of multimodal AI research, aiming for accessible and real-world AI applications.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com