LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal! (2408.13402v2)

Published 23 Aug 2024 in cs.LG

Abstract: Multimodal LLMs (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities. Link to the model: https://huggingface.co/IntelLabs/LlavaOLMoBitnet1B

Abstract PDF HTML Chat (Pro)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a multimodal LLM that applies ternary quantization to achieve efficient performance in low-resource environments.
It presents a three-part architecture using a CLIP image encoder, MLP connector, and a 1.1B parameter ternary LLM core.
The study details a two-phase training process and outlines future directions for accessible, efficient multimodal AI.

LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!

Introduction

"LLaVaOLMoBitnet1B: Ternary LLM goes Multimodal!" by Jainaveen Sundaram and Ravishankar Iyer presents a pioneering effort in extending ternary quantization to multimodal LLMs (MM-LLMs). With the significant advances in MM-LLMs like LLaVa, there remains a pressing need for these models to operate efficiently within constrained compute environments. This work aims to bridge this gap by leveraging ternary quantization, pushing the boundaries of what can be achieved with limited computational resources.

Model Architecture and Training

The model architecture of LLaVaOLMoBitnet1B comprises three fundamental components: a CLIP ViT-L/14 image encoder, an MLP connector, and the ternary LLM core. The image encoder processes input images into patches, which are then projected into a higher-dimensional space by the MLP to align with the LLM's embedding space. The LLM, a ternary variant with 1.1 billion parameters, adopts the BitLinear158 layers, embodying a compact yet efficient design.

The training approach follows the methodology outlined in LLaVa1.5, encompassing a two-phase process: pre-training for feature alignment and end-to-end instruction fine-tuning. The phases utilize different subsets of datasets such as Conceptual Captions for pre-training and LLaVa-Instruct-150K for fine-tuning, following a rigorous regimen involving multi-GPU training facilitated by the DeepSpeed library.

Results

The research compares the performance of LLaVaOLMoBitnet1B with its peers, both in terms of base LLM capabilities and multimodal proficiency.

Base LLM Performance

Though LLaVaOLMoBitnet1B displays lower benchmark scores compared to its counterparts, it must be noted that it is pre-trained on significantly fewer tokens. Specifically, compared to Olmo 1b and BitNetb1.58, which have been trained on 2T and 100B tokens respectively, LLaVaOLMoBitnet1B has only 60B tokens of training exposure, placing it at a disadvantage in direct performance metrics.

| Model | Open/Close Source | Params | Pretrained Tokens | Arc easy | Open bookqa | Piqa | Sciq | ||-|--|-|-|-||| | Gemma2b | Closed | 2B | 6T | 80.1 | - | 77.8 | - | | Olmo 1b | Open Source | 1B | 2T | 58.07 | 46.4 | 73.7 | 88.1 | | BitNetb1.58 | Closed | 1.3B | 100B | 54.9 | 19.6 | 68.8 | - | | OLMoBitNet1B | Open Source | 1B | 60B | 49.93 | 30.4 | 67.25| 74.3 |

Multimodal Performance

LLaVaOLMoBitnet1B's multimodal performance is compared against similar models. Despite being one of the smallest, its performance in tasks such as POPE, VQAv2, and SQA-I lags behind due to its compact size and limited pre-training token exposure. However, the model showcases the feasibility of ternary quantization in multimodal settings, establishing a foundation for future improvements.

Model	Params	Training method	POPE	VQAv2	SQA-I
TinyLLaVa 3B	3B	Llavarecipie	86.8	77.6	69.6
MM13B	3B	MM1recipie	87.4	82	69.4
LlaVaGemma2B	2B	Llavarecipie	85	70.7	56.4
LLaVaOLMoBitNet1B	1B	Llavarecipie	66.92	68.41	34.2

Implications and Future Directions

The introduction of LLaVaOLMoBitnet1B holds significant implications for the future of efficient, multimodal AI. It emphasizes the potential of weight quantization, specifically ternary quantization, in producing robust models that operate within limited computational resources. This could pave the way for democratizing AI by making advanced models accessible to a broader audience with constrained hardware capabilities.

Future research is warranted in exploring post-training quantization techniques for existing pre-trained models to transition them into the ternary domain effectively. Additionally, addressing challenges around biases, model uncertainty, and efficient ternary operation mapping on hardware will be crucial. The insights gained from LLaVaOLMoBitnet1B lay the groundwork for developing more capable ternary multimodal models, enhancing both their performance and accessibility.