Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mipha: A Comprehensive Overhaul of Multimodal Assistant with Small Language Models (2403.06199v4)

Published 10 Mar 2024 in cs.CV and cs.CL

Abstract: Multimodal LLMs (MLLMs) have showcased impressive skills in tasks related to visual understanding and reasoning. Yet, their widespread application faces obstacles due to the high computational demands during both the training and inference phases, restricting their use to a limited audience within the research and user communities. In this paper, we investigate the design aspects of Multimodal Small LLMs (MSLMs) and propose an efficient multimodal assistant named Mipha, which is designed to create synergy among various aspects: visual representation, LLMs, and optimization strategies. We show that without increasing the volume of training data, our Mipha-3B outperforms the state-of-the-art large MLLMs, especially LLaVA-1.5-13B, on multiple benchmarks. Through detailed discussion, we provide insights and guidelines for developing strong MSLMs that rival the capabilities of MLLMs. Our code is available at https://github.com/zhuyiche/llava-phi.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (66)
  1. Minicpm: Unveiling the potential of end-side large language models, 2024.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. Introducing our multimodal models, 2023. URL https://www.adept.ai/blog/fuyu-8b.
  6. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023.
  7. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
  8. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023b.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  10. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  11. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  12. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  13. Mobilevlm v2: Faster and stronger baseline for vision language model. arXiv preprint arXiv:2402.03766, 2024.
  14. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  15. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  16. Safe rlhf: Safe reinforcement learning from human feedback. arXiv preprint arXiv:2310.12773, 2023a.
  17. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023b.
  18. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  19. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  20. Google. Gemma: Introducing new state-of-the-art open models. Google Blog, 2024.
  21. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  22. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  23. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  24. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530, 2024.
  25. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  26. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  27. Phi-2: The surprising power of small language models. Microsoft Research Blog, 2023.
  28. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024.
  29. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  30. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  31. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023b.
  32. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  33. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
  34. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023e.
  35. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
  36. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  37. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575, 2023.
  38. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  39. Visual instruction tuning. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b. URL https://openreview.net/forum?id=w0H2xGHlkw.
  40. Query-relevant images jailbreak large multi-modal models. arXiv preprint arXiv:2311.17600, 2023c.
  41. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
  42. Mobilellm: Optimizing sub-billion parameter language models for on-device use cases. arXiv preprint arXiv:2402.14905, 2024.
  43. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  44. An empirical study of scaling instruct-tuned large multimodal models. arXiv preprint arXiv:2309.09958, 2023.
  45. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  46. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  47. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  48. Imagenet-21k pretraining for the masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. URL https://openreview.net/forum?id=Zkj_VcZ6ol.
  49. Imp: An emprical study of multimodal small language models, 2024. URL https://huggingface.co/MILVLG/imp-v1-3b.
  50. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  51. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  52. Stability. Stablelm: Stability ai language models. Stability, 2024.
  53. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  54. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  55. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  56. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  57. Cogvlm: Visual expert for large language models. arXiv preprint, 2023.
  58. Small language model meets with reinforced vision vocabulary. arXiv preprint arXiv:2401.12503, 2024.
  59. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023a.
  60. mplug-owl2: Revolutionizing multi-modal large language model with modality collaboration. arXiv preprint arXiv:2311.04257, 2023b.
  61. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  62. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
  63. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
  64. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  65. MiniGPT-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=1tZbq88f27.
  66. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Minjie Zhu (14 papers)
  2. Yichen Zhu (51 papers)
  3. Xin Liu (820 papers)
  4. Ning Liu (199 papers)
  5. Zhiyuan Xu (47 papers)
  6. Chaomin Shen (25 papers)
  7. Yaxin Peng (22 papers)
  8. Zhicai Ou (5 papers)
  9. Feifei Feng (23 papers)
  10. Jian Tang (326 papers)
Citations (11)
X Twitter Logo Streamline Icon: https://streamlinehq.com