LLaVA-Phi: Efficient Multi-Modal Assistant with Small Language Model (2401.02330v4)

Published 4 Jan 2024 in cs.CV and cs.CL

Abstract: In this paper, we introduce LLaVA-$\phi$ (LLaVA-Phi), an efficient multi-modal assistant that harnesses the power of the recently advanced small LLM, Phi-2, to facilitate multi-modal dialogues. LLaVA-Phi marks a notable advancement in the realm of compact multi-modal models. It demonstrates that even smaller LLMs, with as few as 2.7B parameters, can effectively engage in intricate dialogues that integrate both textual and visual elements, provided they are trained with high-quality corpora. Our model delivers commendable performance on publicly available benchmarks that encompass visual comprehension, reasoning, and knowledge-based perception. Beyond its remarkable performance in multi-modal dialogue tasks, our model opens new avenues for applications in time-sensitive environments and systems that require real-time interaction, such as embodied agents. It highlights the potential of smaller LLMs to achieve sophisticated levels of understanding and interaction, while maintaining greater resource efficiency.The project is available at {https://github.com/zhuyiche/llava-phi}.

References (36)

Authors (6)

Yichen Zhu (51 papers)
Minjie Zhu (14 papers)
Ning Liu (199 papers)
Zhicai Ou (5 papers)
Xiaofeng Mou (7 papers)
Jian Tang (327 papers)

Citations (65)

View on Semantic Scholar

Summary

Introduction to Multi-Modal Assistants

Multi-modal assistants are increasingly becoming important tools for human-machine interaction, blending textual and visual data processing to offer enriched user experiences. Compact models in this space, such as LLaVA-Phi, leverage small LLMs to achieve high levels of performance. Traditionally, large-scale models with billions of parameters have dominated this landscape. However, the efficiency and efficacy of smaller models, like Phi-2 with 2.7 billion parameters, are now showing that they can hold their own, particularly when trained on quality data. The implication of such advancements is significant for applications where speed and resource efficiency are critical.

Advancements and Comparisons

Prior leading models in multi-modal interaction, from the likes of Gemini to prominent names like TinyGSM, have set the stage by showcasing the benefits of integrating compact models with visual encoders. Nonetheless, many of these pioneering models lack open-source accessibility, which limits academic inquiry and application development. The emergence of LLaVA-Phi pushes the envelope by not only using an open-source, efficient Phi-2 small LLM but also delivering impressive results across various benchmarks, a feat that was presumed to require much larger models.

Technical Aspects and Training

Delving into the technicalities, LLaVA-Phi employs a two-stage training pipeline, fine-tuning a LLM with premium data followed by an instruction tuning phase. This multi-phase approach, optimized with the help of an efficient projector, equips the model with enhanced visual comprehension and interaction capabilities despite its relatively small size. The model's impressive performance is a testament to the advances in pre-training strategies and the meticulous curation of datasets such as LLaVA-Instruct-150K, which facilitate the training of such compact and specialized tools.

Performance and Implications

The capability of LLaVA-Phi is exemplified by its superior performance in engaging with complex tasks like empathetic evaluation of memes and code generation based on visuals. The model shows not just comparable but in some cases superior performance against considerably larger models. Particularly in tasks such as ScienceQA, it has a top performance, likely due to Phi-2's specialization in code generation and mathematical content.

LLaVA-Phi's success suggests a substantial paradigm shift in the development of multi-modal models. It stands as a beacon of efficiency, showing the potential of small-scale models to adapt and excel in tasks once thought exclusive to larger models. This represents a leap forward in operationalizing multi-modal assistants across a spectrum of devices, including those with limited computational resources, and foreshadows a future where such tools are accessible and prevalent across various real-world applications.