Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model (2401.02330v4)

Published 4 Jan 2024 in cs.CV and cs.CL

Abstract: In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small LLM, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller LLMs, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller LLMs to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390, 2023.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  4. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  6. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  8. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  9. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  10. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  11. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  12. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  13. Textbooks are all you need. arXiv preprint arXiv:2306.11644, 2023.
  14. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617, 2018.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  16. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  17. Obelisc: An open web-scale filtered dataset of interleaved image-text documents. arXiv preprint arXiv:2306.16527, 2023.
  18. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  19. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023b.
  20. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023c.
  21. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023d.
  22. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023e.
  23. Tinygsm: achieving¿ 80% on gsm8k with small language models. arXiv preprint arXiv:2312.09241, 2023a.
  24. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  25. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
  26. Llava-plus: Learning to use tools for creating multimodal agents. arXiv preprint arXiv:2311.05437, 2023d.
  27. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023e.
  28. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  29. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  30. OpenAI. Gpt-4 technical report. arXiv preprint, 2023.
  31. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  32. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525, 2023.
  33. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  35. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  36. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models. arXiv preprint, page arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yichen Zhu (51 papers)
  2. Minjie Zhu (14 papers)
  3. Ning Liu (199 papers)
  4. Zhicai Ou (5 papers)
  5. Xiaofeng Mou (7 papers)
  6. Jian Tang (327 papers)
Citations (65)

Summary

Introduction to Multi-Modal Assistants

Multi-modal assistants are increasingly becoming important tools for human-machine interaction, blending textual and visual data processing to offer enriched user experiences. Compact models in this space, such as LLaVA-Phi, leverage small LLMs to achieve high levels of performance. Traditionally, large-scale models with billions of parameters have dominated this landscape. However, the efficiency and efficacy of smaller models, like Phi-2 with 2.7 billion parameters, are now showing that they can hold their own, particularly when trained on quality data. The implication of such advancements is significant for applications where speed and resource efficiency are critical.

Advancements and Comparisons

Prior leading models in multi-modal interaction, from the likes of Gemini to prominent names like TinyGSM, have set the stage by showcasing the benefits of integrating compact models with visual encoders. Nonetheless, many of these pioneering models lack open-source accessibility, which limits academic inquiry and application development. The emergence of LLaVA-Phi pushes the envelope by not only using an open-source, efficient Phi-2 small LLM but also delivering impressive results across various benchmarks, a feat that was presumed to require much larger models.

Technical Aspects and Training

Delving into the technicalities, LLaVA-Phi employs a two-stage training pipeline, fine-tuning a LLM with premium data followed by an instruction tuning phase. This multi-phase approach, optimized with the help of an efficient projector, equips the model with enhanced visual comprehension and interaction capabilities despite its relatively small size. The model's impressive performance is a testament to the advances in pre-training strategies and the meticulous curation of datasets such as LLaVA-Instruct-150K, which facilitate the training of such compact and specialized tools.

Performance and Implications

The capability of LLaVA-Phi is exemplified by its superior performance in engaging with complex tasks like empathetic evaluation of memes and code generation based on visuals. The model shows not just comparable but in some cases superior performance against considerably larger models. Particularly in tasks such as ScienceQA, it has a top performance, likely due to Phi-2's specialization in code generation and mathematical content.

LLaVA-Phi's success suggests a substantial paradigm shift in the development of multi-modal models. It stands as a beacon of efficiency, showing the potential of small-scale models to adapt and excel in tasks once thought exclusive to larger models. This represents a leap forward in operationalizing multi-modal assistants across a spectrum of devices, including those with limited computational resources, and foreshadows a future where such tools are accessible and prevalent across various real-world applications.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com