Introduction
The exploration of Vision LLMs (VLMs) is gaining vigorous momentum in AI research circles, with unprecedented integration capabilities when combined with LLMs. New iterations are constantly emerging, pushing the boundaries of efficiency and performance. Mobile VLM V2 emerges from this ongoing innovation, building upon its predecessor, MobileVLM, and proposes a significantly improved VLM tailored for devices with constrained computational resources, such as mobile devices and embedded systems.
Related Work
A look at the lineage of VLMs reveals a strong emphasis on optimizing a mix of model accuracy and efficiency. The incorporation of Mixture of Experts methods to VLMs noted in MoE-LLaVA offers compelling results yet entails complexities in deployment, especially to edge devices. The nascent but impactful MobileVLM introduced hardware-friendly architecture to optimize performance for on-device deployment. The field is trending towards versatile models that streamline architectures and refine datasets for efficient training and deployment. The recently unveiled innovations like ShareGPT4V highlight impactful strides made in the precise alignment of vision-language features.
Method
MobileVLM V2 sees substantial improvements through three core enhancements: expanded high-quality training datasets, robust training strategies, and an innovatively lightweight projector - LDPv2. The datasets now include 1.2 million high-quality image-text pairs and additional academic datasets to bolster model versatility. The training strategy fully exploits the projector and LLM parameters, enhancing model capabilities to handle diverse tasks. Additionally, the new projector design condenses the number of image tokens while enriching positional information without performance detriment.
Experiment
The model's preeminent contributions manifest in a state-of-the-art tradeoff between performance and inference speed across essential benchmarks, with MobileVLM V2 scaling up to 7B parameters and eclipsing the previous best models. Even at a smaller 3B scale, it achieves superior outcomes than many higher-scaled VLMs, highlighting the efficiency and power of its advanced architecture. In practical terms, MobileVLM V2 operates up to 1.65× faster than other models at similar scales, without sacrificing accuracy. Scaling up the model to 7B parameters further widens the gap in performance. The ability to remove the token reduction component without affecting the latency speed aligns with ShareGPT4V, reinforcing the model's flexible architecture.
Conclusion
MobileVLM V2 significantly broadens the scope and feasibility of deploying advanced AI models on mobile and edge devices. It demonstrates robust improvements over its precursor with better data utilization, improved training, and a more potent projector mechanism. The model's superiority in performance coupled with an edge in inference efficiency adds invaluable credence to the perpetual advancement of multimodal AI research, aiming for accessible and real-world AI applications.