Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models (2401.15947v5)

Published 29 Jan 2024 in cs.CV
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Abstract: Recent advances demonstrate that scaling Large Vision-LLMs (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

Introduction

In the landscape of Large Vision-LLMs (LVLMs), the expansion of model parameters is a common approach to augment model capabilities, but this follows an increased computational burden during training and deployment. Dense models, where each token computation engages all model parameters, exacerbate this issue. Conversely, the Mixture of Experts (MoE) approach has exhibited success in scaling model capacity with fixed computational costs, particularly in the field of NLP.

Methodology: MoE-LLaVA and MoE-tuning

The paper introduces MoE-LLaVA, a framework for sparse LVLMs that leverages an MoE architecture with carefully engineered routers to selectively activate only the top-k experts. This unique configuration enables the maintenance of a constant computational cost while significantly expanding the model's parameter number. The framework consists of a vision encoder, visual projection layer, word embedding layer, LLM blocks, and sparse MoE blocks. The MoE-tuning strategy employs a novel three-stage training process to adapt MoE to LVLMs without performance degradation typically caused by model sparsity.

Experimental Results

Extensive experimentation validates the efficacy of MoE-LLaVA. When benchmarked against multiple visual understanding datasets, models with an unreasonably small parameter count of 3 billion—activated only sparsely—rivaled the performance of LLaVA models with up to 7 billion parameters. The authors establish that MoE-LLaVA delivers performance equivalent to dense LVLMs while requiring fewer computational resources, thus marking a significant contribution towards efficient multi-modal learning.

Contributions and Implications

The primary contributions are multifold:

  1. The innovation of MoE-tuning methodology for adapting MoE to LVLMs, which prevents degradation due to sparsity.
  2. The establishment of MoE-LLaVA, a pioneering framework for sparse LVLMs, which allows for substantial model size without proportional increases in computational demands.
  3. The demonstration through experiments that MoE-LLaVA possesses superior capabilities in multi-modal understanding and exhibits an impressive restraint on hallucination — it outpaces 13-billion-parameter models using only 3 billion sparsely activated parameters.

In theory, MoE-LLaVA has set a new precedent for developing scalable and efficient LVLMs. Results indicate that the paper's contributions could redefine model scaling paradigms, presenting a model that effectively navigates the trade-off between size, performance, and computational cost, which remains a critical challenge in AI research. Future research could expand upon these findings to include a wider array of multi-modal tasks and larger MoE-based LVLMs provided that adequate data pipelines are established.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (102)
  1. 01-ai. Building the next generation of open-source and bilingual llms. https://github.com/01-ai/Yi, 2023.
  2. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  3. Adaptive input representations for neural language modeling. arXiv preprint arXiv:1809.10853, 2018.
  4. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  5. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  6. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023.
  9. Eve: Efficient vision-language pre-training with masked prediction and modality-aware moe. arXiv preprint arXiv:2308.11971, 2023a.
  10. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023b.
  11. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023c.
  12. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023d.
  13. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023e.
  14. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  15. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  16. Efficient and effective text encoding for chinese llama and alpaca. arXiv preprint arXiv:2304.08177, 2023.
  17. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  18. DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.
  19. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  20. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  21. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  22. falconry. Falcon-180b. https://falconllm.tii.ae/, 2023.
  23. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  24. FlagAI-Open. Aquila2-34b. https://github.com/FlagAI-Open/Aquila2, 2023.
  25. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790, 2023.
  26. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379, 2023.
  27. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017.
  28. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608–3617, 2018.
  29. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  30. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  31. Adaptive mixtures of local experts. Neural computation, 3(1):79–87, 1991.
  32. Mistral 7b, 2023.
  33. Mixtral of experts, 2024.
  34. Grounding language models to images for multimodal generation. arXiv preprint arXiv:2301.13823, 2023.
  35. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  36. Beyond distillation: Task-level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742, 2021.
  37. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  38. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
  39. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  40. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023a.
  41. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  42. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023b.
  43. Flm-101b: An open llm and how to train it with 100 k budget. arXiv preprint arXiv:2309.03852, 2023c.
  44. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023d.
  45. Pace: Unified multi-modal dialogue pre-training with progressive and compositional experts. arXiv preprint arXiv:2305.14839, 2023e.
  46. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  47. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  48. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  49. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  50. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
  51. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
  52. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  12009–12019, 2022.
  53. Interngpt: Solving vision-centric tasks by interacting with chatgpt beyond language. arXiv preprint arXiv:2305.05662, 3, 2023e.
  54. Multiway-adapater: Adapting large-scale multi-modal models for scalable image-text retrieval. arXiv preprint arXiv:2309.01516, 2023.
  55. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  56. Cot-mote: Exploring contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195, 2023.
  57. Microsoft. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models, 2023.
  58. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
  59. OpenAI. Gpt-4 technical report, 2023.
  60. Pearson, K. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin philosophical magazine and journal of science, 2(11):559–572, 1901.
  61. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  62. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  63. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023.
  64. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  65. Glamm: Pixel grounding large multimodal model. arXiv preprint arXiv:2311.03356, 2023.
  66. Rome: Role-aware mixture-of-expert transformer for text-to-video retrieval. arXiv preprint arXiv:2206.12845, 2022.
  67. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  68. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  69. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
  70. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019.
  71. Moss: Training conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7, 2023.
  72. SUSTech-IDEA. Sus-chat: Instruction tuning done right. https://github.com/SUSTech-IDEA/SUS-Chat, 2023.
  73. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  74. Team, I. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  75. Team, S. A. L. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b).
  76. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  77. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  78. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023a.
  79. To see is to believe: Prompting gpt-4v for better visual instruction tuning. arXiv preprint arXiv:2311.07574, 2023b.
  80. Learning deep transformer models for machine translation. arXiv preprint arXiv:1906.01787, 2019.
  81. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  82. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023c.
  83. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023d.
  84. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  85. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
  86. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  87. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  88. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  89. Tinygpt-v: Efficient multimodal large language model via small backbones. arXiv preprint arXiv:2312.16862, 2023.
  90. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  91. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a.
  92. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  93. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023b.
  94. Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pp.  4435–4439, 2023.
  95. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023c.
  96. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023a.
  97. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581, 2023b.
  98. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  99. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  100. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems, 35:2664–2678, 2022.
  101. Llava-phi: Efficient multi-modal assistant with small language model, 2024.
  102. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Bin Lin (33 papers)
  2. Zhenyu Tang (39 papers)
  3. Yang Ye (34 papers)
  4. Peng Jin (91 papers)
  5. Junwu Zhang (13 papers)
  6. Munan Ning (19 papers)
  7. Li Yuan (141 papers)
  8. Jinfa Huang (25 papers)
  9. Yatian Pang (13 papers)
  10. Jiebo Luo (355 papers)
Citations (111)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com