Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TinyLLaVA: A Framework of Small-scale Large Multimodal Models (2402.14289v1)

Published 22 Feb 2024 in cs.LG and cs.CL
TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Abstract: We present the TinyLLaVA framework that provides a unified perspective in designing and analyzing the small-scale Large Multimodal Models (LMMs). We empirically study the effects of different vision encoders, connection modules, LLMs, training data and training recipes. Our extensive experiments showed that better quality of data combined with better training recipes, smaller LMMs can consistently achieve on-par performances compared to bigger LMMs. Under our framework, we train a family of small-scale LMMs. Our best model, TinyLLaVA-3.1B, achieves better overall performance against existing 7B models such as LLaVA-1.5 and Qwen-VL. We hope our findings can serve as baselines for future research in terms of data scaling, training setups and model selections. Our model weights and codes will be made public.

An Overview of TinyLLaVA: A Framework of Small-scale Large Multimodal Models

The paper introduces TinyLLaVA, a comprehensive framework designed for the development and examination of small-scale large multimodal models (LMMs). This research focuses on evaluating the effects of various components, such as vision encoders, connection modules, LLMs, training data, and recipes, within multimodal architectures. TinyLLaVA seeks to demonstrate that smaller LMMs can perform comparably to their larger counterparts when optimized with high-quality data and appropriate training techniques.

Model Architecture and Experimental Setup

TinyLLaVA incorporates three main components: a small-scale LLM, a vision encoder, and a connector. The researchers selected three representative small-scale LLMs: TinyLlama, StableLM-2, and Phi-2, with parameters ranging from 1.1B to 2.7B. For vision encoders, CLIP-Large and SigLIP were chosen, supported by a two-layer MLP with GELU activation as the connector. These components offer a diverse spectrum of implementations within the framework, which aids in understanding the variability and performance of different configurations.

Extensive experiments were conducted using two primary training datasets: LLaVA-1.5 and ShareGPT4V, each providing distinct benefits and challenges. The training recipes, termed base and share, differ in their approach to pre-training and fine-tuning, including varying the number of trainable parameters.

Key Findings

  1. Impact of Model Architecture: The experiments revealed that larger LLMs tend to enhance overall performance. Phi-2 based models, despite being more parameter-intensive, consistently outperformed others, especially in domains requiring rich comprehension like ScienceQA-IMG. SigLIP, when used as a vision encoder, provided substantial improvements over CLIP, attributed to its higher resolution and more visual tokens.
  2. Role of Training Data and Recipes: A significant point of analysis was the effect of the training dataset and the configuration of training recipes. Training on the ShareGPT4V dataset, which is larger and more diverse, generally led to superior performance. Fine-tuning, especially when fine-tuning part of the vision encoder, improved outcomes for smaller LLMs. However, this introduced more hallucinations for models with larger LLMs, suggesting a nuanced trade-off between trainability and performance fidelity.
  3. Comparison Against State-of-the-art Models: TinyLLaVA-3.1B demonstrated noteworthy results on benchmark tests, outperforming existing 7B models like LLaVA-1.5 and Qwen-VL in several categories. This achievement underscores the framework's potential in designing efficient and powerful small-scale models.

Implications and Future Directions

TinyLLaVA's experimental results contribute valuable insights for the design of resource-efficient LMMs, opening avenues for wider accessibility in research environments with limited computational capacity. The findings suggest that optimization strategies focusing on data quality and selective fine-tuning can mitigate the parameter disadvantage of smaller LMMs.

Future research directions could explore the integration of dynamic training strategies that adaptively allocate resources based on model architecture and data properties. Another intriguing direction would be enhanced connector designs that further improve the interplay between vision and language modalities, possibly incorporating more sophisticated architectures or learning techniques.

In conclusion, the TinyLLaVA framework offers a robust baseline for future exploration in the scaling down of multimodal models without compromising on performance, thereby contributing significantly to the fields of AI and machine learning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Openflamingo: An open-source framework for training large autoregressive vision-language models. Technical report, 2023.
  4. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  5. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  6. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023.
  7. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  8. Unifying vision-and-language tasks via text generation. In International Conference on Machine Learning, pages 1931–1942. PMLR, 2021.
  9. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  10. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  11. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  12. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  14. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  15. Magma–multimodal augmentation of generative models through adapter-based finetuning. arXiv preprint arXiv:2112.05253, 2021.
  16. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  17. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  18. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  19. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  20. https://sharegpt.com/. Sharegpt. 2023.
  21. Language is not all you need: Aligning perception with language models. Advances in Neural Information Processing Systems, 36, 2023.
  22. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  23. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  24. Perceiver: General perception with iterative attention. In International conference on machine learning, pages 4651–4664. PMLR, 2021.
  25. Dakota Mahan Carlos Riquelme Ruiz Jonathan Tow, Marco Bellagente. Stablelm: Stability ai language models. Technical report, 2023.
  26. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  27. Prismatic vlms: Investigating the design space of visually-conditioned language models. arXiv preprint arXiv:2402.07865, 2024.
  28. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
  29. Obelics: An open web-scale filtered dataset of interleaved image-text documents. Advances in Neural Information Processing Systems, 36, 2023.
  30. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  31. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  32. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
  33. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463, 2023.
  34. Monkey: Image resolution and text label are important things for large multi-modal models. arXiv preprint arXiv:2311.06607, 2023.
  35. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
  36. Microsoft coco: Common objects in context. In ECCV, 2014.
  37. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  38. Visual instruction tuning. Advances in neural information processing systems, 36, 2023.
  39. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  40. Learn to explain: Multimodal reasoning via thought chains for science question answering. In The 36th Conference on Neural Information Processing Systems (NeurIPS), 2022.
  41. Senior Researcher Sébastien Bubeck Mojan Javaheripi. Phi-2: The surprising power of small language models. Technical report, 2023.
  42. OpenAI. Chatgpt: Openai’s gpt-based conversational agent. https://openai.com/chatgpt, 2022.
  43. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  44. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  45. Towards vqa models that can read. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  46. Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990, 2022.
  47. Stability AI Language Team. Stable lm 2 1.6b.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  50. Simvlm: Simple visual language model pretraining with weak supervision. arXiv preprint arXiv:2108.10904, 2021.
  51. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  52. Zi Lin Ying Sheng Zhanghao Wu Hao Zhang Lianmin Zheng Siyuan Zhuang Yonghao Zhuang Joseph E. Gonzalez Ion Stoica Wei-Lin Chiang, Zhuohan Li and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90 Technical report, 2023.
  53. Baichuan 2: Open large-scale language models. Technical report, 2023.
  54. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  55. Kun Zhou Jinpeng Wang Wayne Xin Zhao Yifan Li, Yifan Du and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  56. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  57. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.
  58. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, October 2023.
  59. Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385, 2024.
  60. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  61. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  62. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  63. Llava-phi: Efficient multi-modal assistant with small language model. arXiv preprint arXiv:2401.02330, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Baichuan Zhou (5 papers)
  2. Ying Hu (121 papers)
  3. Xi Weng (13 papers)
  4. Junlong Jia (3 papers)
  5. Jie Luo (100 papers)
  6. Xien Liu (16 papers)
  7. Ji Wu (62 papers)
  8. Lei Huang (175 papers)
Citations (71)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com