Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Imp: Highly Capable Large Multimodal Models for Mobile Devices (2405.12107v2)

Published 20 May 2024 in cs.CV and cs.CL
Imp: Highly Capable Large Multimodal Models for Mobile Devices

Abstract: By harnessing the capabilities of LLMs, recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp -- a family of highly capable LMMs at the 2B-4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.

Journey Towards Lightweight Large Multimodal Models (LMMs)

Introduction to Lightweight LMMs

When it comes to developing complex AI systems, the size and compute requirements often become significant stumbling blocks. LLMs like GPT-4 and Gemini-1.5 have pushed the boundaries of AI capabilities, yet they're computationally intensive. Researchers are increasingly exploring Large Multimodal Models (LMMs) that combine various types of data (like text and images) to achieve even more complex functionalities. However, these LMMs often come with heavy computational demands.

The paper introduces a new family of LMMs called Imp, designed to be both effective and lightweight. These models aim to strike a balance between maintaining high performance and reducing computational overhead, making them feasible for deployment on everyday devices like mobile phones.

Key Design Choices

The key to building these lightweight models lies in careful design choices across model architecture, training strategy, and training data. Here's a breakdown of how these choices come together.

Model Architecture

Choice of LLM:

  • The Imp models start by selecting smaller but effective LLMs, such as Phi-2 (2.7B parameters) and MobileLLaMA (2.7B parameters).
  • Phi-2 outperformed MobileLLaMA significantly, primarily because of its high-quality training dataset.

Choice of Visual Encoder:

  • Most LMMs use visual encoders based on models like CLIP. The Imp models experimented with several visual encoders, including the SigLIP model, which performed best due to its extensive training on image-text pairs.
  • With the SigLIP visual encoder, Imp LMMs achieve superior performance at a smaller computational scale compared to their counterparts.

Training Strategy

Finetuning Mechanism:

  • The researchers found that LoRA finetuning outperformed traditional full-parameter finetuning. Specifically, a LoRA rank of 256 offered the best balance between model capability and resource efficiency.

Training Epochs:

  • Training for just one epoch often left the model under-optimized. Instead, training for two epochs provided a notable boost in performance without a significant increase in computational requirements.

Enhanced Training Data

OCR and Chart Understanding:

  • Introducing data from datasets like DVQA and ChartQA, which focus on OCR (Optical Character Recognition) and chart understanding, showed marked improvement in the model's ability to handle tasks requiring text recognition within images.

GPT-4V Annotated Data:

  • Incorporating GPT-4V annotated datasets helped in fine-tuning the LMM’s capabilities to better generate instructions and engage in conversations, significantly bolstering the model's overall performance.

Results and Comparisons

The paper showcases various Imp models (Imp-2B, Imp-3B, and Imp-4B). Let’s delve into some notable results:

  • Imp-3B model managed to outperform many existing 7B and even 13B parameter models across several benchmarks.
  • Imp-2B particularly excelled in multilingual understanding, showing robust performance in Chinese text despite being trained primarily on English data.
  • The Imp-4B model combined all the improvements and delivered strong results across a multitude of benchmarks, thereby proving the viability of small yet potent LMMs.

Deployment on Mobile Devices

One of the major advantages of these lightweight Imp models is their deployability on mobile devices. Using techniques like low-bit quantization, the researchers optimized Imp-3B to run efficiently even on devices powered by Snapdragon chips.

  • Performance and Speed:
    • On mobile devices, the model demonstrated high inference speeds, making real-time applications plausible.
    • Reducing the image resolution did not significantly impact the overall performance, ensuring a good balance between latency and model capability.

Practical Implications and Future Work

The Imp models lay down a promising path for deploying high-performance AI in resource-constrained environments such as mobile devices and edge computing. This makes advanced AI accessible to a broader range of applications, including personal assistants, real-time translation services, and more.

Looking Forward

Future improvements could involve:

  1. Introducing more diverse and high-quality datasets to further refine model capabilities.
  2. Implementing advanced training strategies like knowledge distillation.
  3. Exploring more efficient model compression techniques.
  4. Extending support for additional input modalities such as audio and 3D data.

The researchers are also focusing on practical deployments and have developed ImpChat, a multi-platform assistant leveraging these lightweight models. This ensures that you can have a robust AI assistant across various devices without the need for extensive resources.

As we move forward, continued efforts to refine these lightweight yet powerful models could lead to a broader, more inclusive application of AI technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  2. Qwen technical report. Technical report, Alibaba Group, 2023a. https://arxiv.org/abs/2303.08774.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  4. Language models are few-shot learners. In Advances in Neural Information Processing Systems, pages 1877–1901, 2020.
  5. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024.
  6. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023a.
  7. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023b.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  9. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  10. Mobilevlm : A fast, strong and open vision language assistant for mobile devices, 2023.
  11. Instructblip: Towards general-purpose vision-language models with instruction tuning. ArXiv, abs/2305.06500, 2023. URL https://api.semanticscholar.org/CorpusID:258615266.
  12. Pengi: An audio language model for audio tasks. Advances in Neural Information Processing Systems, 36:18090–18108, 2023.
  13. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT 2019, pages 4171–4186, 2019.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  16. Sphinx-x: Scaling data and parameters for a family of multi-modal large language models. arXiv preprint arXiv:2402.05935, 2024.
  17. G. Gemma team. Gemma: Open models based on gemini research and technology, 2024.
  18. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
  19. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  20. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  21. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  22. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530, 2024.
  23. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  24. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  25. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
  26. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  27. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
  28. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  29. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024.
  30. A diagram is worth a dozen images. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 235–251. Springer, 2016.
  31. LAION-GPT-V. https://huggingface.co/datasets/laion/gpt4v-dataset, 2023.
  32. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  33. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023b.
  34. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  35. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
  36. Vila: On pre-training for visual language models. arXiv preprint arXiv:2312.07533, 2023.
  37. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  38. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  39. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024. URL https://llava-vl.github.io/blog/2024-01-30-llava-next/.
  40. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023c.
  41. Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525, 2024.
  42. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  43. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  44. Chartqa: A benchmark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244, 2022.
  45. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021.
  46. Infographicvqa. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1697–1706, 2022.
  47. Microsoft. Phi-3 technical report: A highly capable language model locally on your phone, 2024.
  48. OpenAI. Chatgpt: Optimizing language models for dialogue. https://openai.com/blog/chatgpt, 2023a. Accessed: 2023.
  49. OpenAI. Gpt-4 technical report. Technical report, OpenAI, 2023b.
  50. OpenAI. Gpt-4v(ision) system card. Technical report, OpenAI, 2023c.
  51. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  52. Improving language understanding by generative pre-training. 2018.
  53. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  54. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  55. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  56. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
  57. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer, 2020.
  58. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  59. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023.
  60. G. Team. Gemini: A family of highly capable multimodal models. Technical report, Google, 2023. https://storage.googleapis.com/deepmind-media/gemini/gemini_1_report.pdf.
  61. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  62. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  63. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  64. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023.
  65. Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024.
  66. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  67. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023a.
  68. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023b.
  69. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023.
  70. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  71. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.
  72. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986, 2023.
  73. Next-chat: An lmm for chat, detection and segmentation. arXiv preprint arXiv:2311.04498, 2023a.
  74. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023b.
  75. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023c.
  76. Tinyllama: An open-source small language model, 2024.
  77. Languagebind: Extending video-language pretraining to n-modality by language-based semantic alignment. arXiv preprint arXiv:2310.01852, 2023a.
  78. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023b.
  79. A comprehensive overhaul of multimodal assistant with small language models. arXiv preprint arXiv:2403.06199, 2024a.
  80. Llava-phi: Efficient multi-modal assistant with small language model, 2024b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zhenwei Shao (3 papers)
  2. Zhou Yu (206 papers)
  3. Jun Yu (232 papers)
  4. Xuecheng Ouyang (2 papers)
  5. Lihao Zheng (4 papers)
  6. Zhenbiao Gai (1 paper)
  7. Mingyang Wang (36 papers)
  8. Jiajun Ding (6 papers)
Citations (9)