Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Vision-Language Models with Sparse Mixture of Experts (2303.07226v1)

Published 13 Mar 2023 in cs.CV and cs.CL

Abstract: The field of NLP has made significant strides in recent years, particularly in the development of large-scale vision-LLMs (VLMs). These models aim to bridge the gap between text and visual information, enabling a more comprehensive understanding of multimedia data. However, as these models become larger and more complex, they also become more challenging to train and deploy. One approach to addressing this challenge is the use of sparsely-gated mixture-of-experts (MoE) techniques, which divide the model into smaller, specialized sub-models that can jointly solve a task. In this paper, we explore the effectiveness of MoE in scaling vision-LLMs, demonstrating its potential to achieve state-of-the-art performance on a range of benchmarks over dense models of equivalent computational cost. Our research offers valuable insights into stabilizing the training of MoE models, understanding the impact of MoE on model interpretability, and balancing the trade-offs between compute performance when scaling VLMs. We hope our work will inspire further research into the use of MoE for scaling large-scale vision-LLMs and other multimodal machine learning applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sheng Shen (68 papers)
  2. Zhewei Yao (64 papers)
  3. Chunyuan Li (122 papers)
  4. Trevor Darrell (324 papers)
  5. Kurt Keutzer (199 papers)
  6. Yuxiong He (59 papers)
Citations (49)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets