Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices (2312.16886v2)

Published 28 Dec 2023 in cs.CV

Abstract: We present MobileVLM, a competent multimodal vision LLM (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of LLMs at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector. We evaluate MobileVLM on several typical VLM benchmarks. Our models demonstrate on par performance compared with a few much larger models. More importantly, we measure the inference speed on both a Qualcomm Snapdragon 888 CPU and an NVIDIA Jeston Orin GPU, and we obtain state-of-the-art performance of 21.5 tokens and 65.3 tokens per second, respectively. Our code will be made available at: https://github.com/Meituan-AutoML/MobileVLM.

Insights into MobileVLM: A Vision LLM for Mobile Devices

The paper "MobileVLM: A Fast, Strong, and Open Vision Language Assistant for Mobile Devices" presents a groundbreaking approach for deploying multimodal vision LLMs (VLMs) on resource-constrained platforms. MobileVLM is crafted to balance high performance with efficient resource utilization, making it suitable for mobile and IoT devices.

Key Contributions and Design

MobileVLM distinguishes itself by integrating lightweight yet powerful components, optimized for mobile environments. It includes:

  1. Efficient Vision Encoder: Utilizing the CLIP ViT-L/14, this model harnesses natural language supervision to achieve robust visual feature extraction, enhancing tasks such as visual question answering and image captioning.
  2. Mobile-tailored LLMs: Known as MobileLLaMA, these models are scaled versions of LLaMA with sizes of 1.4B and 2.7B parameters, adapted for mobile environments. They employ efficient architectural modifications, including RoPE for positional encoding and RMSNorm for stable training, contributing to their quick inference capabilities.
  3. Lightweight Downsample Projector (LDP): This novel component aligns visual features with the word embedding space, reducing the number of visual tokens and hence, the computational load without significant performance loss.

Performance and Evaluation

MobileVLM exhibits competitive results on various VLM benchmarks despite its reduced computational footprint. Notably, it achieves commendable inference speeds of 21.5 tokens/s on a Qualcomm Snapdragon CPU and 65.3 tokens/s on an NVIDIA Jeston Orin GPU. The model outperforms many larger models in tasks such as general question answering and visual reasoning.

In the latency analysis, MobileVLM demonstrates superior performance on both mobile and IoT devices compared to peers like OpenLLaMA and TinyLLaMA, proving its suitability for real-world applications.

Future Directions and Implications

The design decisions in MobileVLM indicate a shift towards deploying sophisticated AI models in resource-limited scenarios, expanding the applicability of AI in mobile and edge computing environments. This work prompts further exploration into model compression and efficiency techniques, potentially influencing future research in mobile AI deployment.

Researchers might further investigate optimizing neural architecture search for LLMs, exploring more efficient training paradigms, and expanding the use of high-quality datasets for better alignment of multimodal tasks.

Conclusion

MobileVLM sets a precedent in reducing the barriers to deploying VLMs on mobile and low-power devices. By maintaining a balance between performance and efficiency, this paper contributes significantly to the field of AI, specifically in enhancing the reach of intelligent systems in everyday mobile applications. This work is poised to advance the implementation of vision-language capabilities in diverse, real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (133)
  1. An in-depth look at gemini’s language abilities. arXiv preprint arXiv:2312.11444, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Openflamingo, Mar. 2023.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  5. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966, 2023.
  6. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  7. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  8. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 7432–7439, 2020.
  9. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, Mar. 2021. If you use this software, please cite it using these metadata.
  10. A systematic classification of knowledge, reasoning, and context within the ARC dataset. arXiv preprint arXiv:1806.00358, 2018.
  11. Coyo-700m: Image-text pair dataset. https://github.com/kakaobrain/coyo-dataset, 2022.
  12. Once for all: Train one network and specialize it for efficient deployment. In International Conference on Learning Representations, 2020.
  13. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  14. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  15. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  16. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  17. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  18. PaLI-X: On scaling up a multilingual vision and language model. 2023.
  19. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794, 2022.
  20. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  21. Make repvgg greater again: A quantization-aware approach. In AAAI, 2023.
  22. Twins: Revisiting the design of spatial attention in vision transformers. In Adv. Neural Inform. Process. Syst., 2021.
  23. Conditional positional encodings for vision transformers. In The Eleventh International Conference on Learning Representations, 2023.
  24. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. In Proceedings of the IEEE/CVF International Conference on computer vision, pages 12239–12248, 2021.
  25. Fair darts: Eliminating unfair advantages in differentiable architecture search. In European conference on computer vision, pages 465–480. Springer, 2020.
  26. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  27. Boolq: Exploring the surprising dsifficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
  28. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  29. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023.
  30. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  31. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  32. Embodied question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018.
  33. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  34. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  35. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  36. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022.
  37. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369, 2023.
  38. Sparsegpt: Massive language models can be accurately pruned in one-shot, 2023.
  39. Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
  40. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  41. A challenger to gpt-4v? early explorations of gemini in visual expertise. arXiv preprint arXiv:2312.12436, 2023.
  42. A framework for few-shot language model evaluation, Sept. 2021.
  43. Openllama: An open reproduction of llama, May 2023.
  44. Georgi Gerganov. llama.cpp. https://github.com/ggerganov/llama.cpp. [Accessed: 2023-11-07].
  45. Google. Gemini: A family of highly capable multimodal models. 2023.
  46. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  47. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  48. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  49. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  50. Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019.
  51. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  52. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  53. Huggingface. https://huggingface.co/datasets/Aeala/ShareGPT_Vicuna_unfiltered.
  54. Openclip. July 2021, 2021. If you use this software, please cite it as below.
  55. InternLM. Lmdeploy. https://github.com/InternLM/lmdeploy. [Accessed: 2023-11-07].
  56. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International conference on machine learning, pages 448–456. pmlr, 2015.
  57. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  58. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  59. All tokens matter: Token labeling for training better vision transformers. Advances in neural information processing systems, 34:18590–18602, 2021.
  60. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  61. Referitgame: Referring to objects in photographs of natural scenes. pages 787–798, 2014.
  62. Visual Genome: Connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis., 123:32–73, 2017.
  63. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  64. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  65. The BigScience corpus: A 1.6 TB composite multilingual dataset. 2022.
  66. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  67. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  68. Norm tweaking: High-performance low-bit quantization of large language models. In AAAI, 2023.
  69. Robust navigation with language pretraining and stochastic sampling. arXiv preprint arXiv:1909.02244, 2019.
  70. Textbooks are all you need ii: phi-1.5 technical report, 2023.
  71. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023.
  72. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  73. Microsoft COCO: Common objects in context. In Eur. Conf. Comput. Vis., pages 740–755. Springer, 2014.
  74. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  75. Improved baselines with visual instruction tuning. arXiv:2310.03744, 2023.
  76. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  77. DARTS: Differentiable architecture search. In International Conference on Learning Representations, 2019.
  78. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023.
  79. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
  80. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  81. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
  82. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  83. Learn to explain: Multimodal reasoning via thought chains for science question answering. In Advances in Neural Information Processing Systems, pages 27730–27744, 2022.
  84. Llm-pruner: On the structural pruning of large language models, 2023.
  85. Point and ask: Incorporating pointing into visual question answering. arXiv preprint arXiv:2011.13681, 2020.
  86. Embodiedgpt: Vision-language pre-training via embodied chain of thought. arXiv preprint arXiv:2305.15021, 2023.
  87. NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM. [Accessed: 2023-11-07].
  88. OpenAI. ChatGPT. https://openai.com/blog/ChatGPT/, 2023. Online; accessed 2023-01-01.
  89. OpenAI. Gpt-4 technical report. 2023. Technical Report.
  90. OpenAI. Gpt-4v(ision) system card. 2023.
  91. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
  92. Episodic transformer for vision-and-language navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15942–15952, 2021.
  93. Tianduo Wang Peiyuan Zhang, Guangtao Zeng and Wei Lu. Tinyllama, Sep 2023.
  94. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  95. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015.
  96. Exploring stochastic autoregressive image modeling for visual representation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 2074–2081, 2023.
  97. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  98. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  99. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  100. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  101. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  102. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  103. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
  104. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  105. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  106. Llm-planner: Few-shot grounded planning for embodied agents with large language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2998–3009, 2023.
  107. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, page 127063, 2023.
  108. Distilling internet-scale vision-language models into embodied agents. arXiv preprint arXiv:2301.12507, 2023.
  109. A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695, 2023.
  110. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023.
  111. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  112. Galactica: A large language model for science. 2022.
  113. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  114. Training data-efficient image transformers and distillation through attention. In International Conference on Machine Learning, volume 139, pages 10347–10357, July 2021.
  115. Llama: Open and efficient foundation language models. 2023.
  116. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  117. Mobileone: An improved one millisecond mobile backbone. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7907–7917, 2023.
  118. Vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.
  119. Foundation transformers. arXiv preprint arXiv:2210.06423, 2022.
  120. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023.
  121. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. In Proceedings of the IEEE/CVF international conference on computer vision, pages 568–578, 2021.
  122. Lenna: Language enhanced reasoning detection assistant. arXiv preprint arXiv:2312.02433, 2023.
  123. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  124. Smoothquant: Accurate and efficient post-training quantization for large language models. In International Conference on Machine Learning, pages 38087–38099. PMLR, 2023.
  125. Zeroquant-v2: Exploring post-training quantization in llms from comprehensive study to low rank compensation, 2023.
  126. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  127. A survey on multimodal large language models. arXiv preprint arXiv:2306.13549, 2023.
  128. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  129. Root mean square layer normalization. Advances in Neural Information Processing Systems, 32, 2019.
  130. Lifting the curse of capacity gap in distilling language models, 2023.
  131. OPT: Open pre-trained transformer language models. 2022.
  132. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  133. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xiangxiang Chu (62 papers)
  2. Limeng Qiao (11 papers)
  3. Xinyang Lin (5 papers)
  4. Shuang Xu (59 papers)
  5. Yang Yang (884 papers)
  6. Yiming Hu (28 papers)
  7. Fei Wei (35 papers)
  8. Xinyu Zhang (296 papers)
  9. Bo Zhang (633 papers)
  10. Xiaolin Wei (42 papers)
  11. Chunhua Shen (404 papers)
Citations (16)
Github Logo Streamline Icon: https://streamlinehq.com