Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

6 1

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices (2312.16886v2)

Published 28 Dec 2023 in cs.CV

Abstract: We present MobileVLM, a competent multimodal vision LLM (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of LLMs at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

PDF HTML Abstract

Insights into MobileVLM: A Vision LLM for Mobile Devices

The paper "MobileVLM: A Fast, Strong, and Open Vision Language Assistant for Mobile Devices" presents a groundbreaking approach for deploying multimodal vision LLMs (VLMs) on resource-constrained platforms. MobileVLM is crafted to balance high performance with efficient resource utilization, making it suitable for mobile and IoT devices.

Key Contributions and Design

MobileVLM distinguishes itself by integrating lightweight yet powerful components, optimized for mobile environments. It includes:

Efficient Vision Encoder: Utilizing the CLIP ViT-L/14, this model harnesses natural language supervision to achieve robust visual feature extraction, enhancing tasks such as visual question answering and image captioning.
Mobile-tailored LLMs: Known as MobileLLaMA, these models are scaled versions of LLaMA with sizes of 1.4B and 2.7B parameters, adapted for mobile environments. They employ efficient architectural modifications, including RoPE for positional encoding and RMSNorm for stable training, contributing to their quick inference capabilities.
Lightweight Downsample Projector (LDP): This novel component aligns visual features with the word embedding space, reducing the number of visual tokens and hence, the computational load without significant performance loss.

Performance and Evaluation

MobileVLM exhibits competitive results on various VLM benchmarks despite its reduced computational footprint. Notably, it achieves commendable inference speeds of 21.5 tokens/s on a Qualcomm Snapdragon CPU and 65.3 tokens/s on an NVIDIA Jeston Orin GPU. The model outperforms many larger models in tasks such as general question answering and visual reasoning.

In the latency analysis, MobileVLM demonstrates superior performance on both mobile and IoT devices compared to peers like OpenLLaMA and TinyLLaMA, proving its suitability for real-world applications.

Future Directions and Implications

The design decisions in MobileVLM indicate a shift towards deploying sophisticated AI models in resource-limited scenarios, expanding the applicability of AI in mobile and edge computing environments. This work prompts further exploration into model compression and efficiency techniques, potentially influencing future research in mobile AI deployment.

Researchers might further investigate optimizing neural architecture search for LLMs, exploring more efficient training paradigms, and expanding the use of high-quality datasets for better alignment of multimodal tasks.

Conclusion

MobileVLM sets a precedent in reducing the barriers to deploying VLMs on mobile and low-power devices. By maintaining a balance between performance and efficiency, this paper contributes significantly to the field of AI, specifically in enhancing the reach of intelligent systems in everyday mobile applications. This work is poised to advance the implementation of vision-language capabilities in diverse, real-world scenarios.

PDF Markdown Bookmark Chat (Pro)

References (133)

Authors (11)

Xiangxiang Chu (62 papers)
Limeng Qiao (11 papers)
Xinyang Lin (5 papers)
Shuang Xu (59 papers)
Yang Yang (884 papers)
Yiming Hu (28 papers)
Fei Wei (35 papers)
Xinyu Zhang (296 papers)
Bo Zhang (633 papers)
Xiaolin Wei (42 papers)
Chunhua Shen (404 papers)

Citations (16)

View on Semantic Scholar

GitHub

GitHub - Meituan-AutoML/MobileVLM: Strong and Open Vision Language Assistant for Mobile Devices (838 stars)

Tweets

https://twitter.com/Aki__Singh/status/1753826013770051951

https://twitter.com/22146921/status/1741235326579085563