Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

38 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

1.1k 4 2 1

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models (2401.15947v5)

Published 29 Jan 2024 in cs.CV

Abstract: Recent advances demonstrate that scaling Large Vision-LLMs (LVLMs) effectively improves downstream task performances. However, existing scaling methods enable all model parameters to be active for each token in the calculation, which brings massive training and inferring costs. In this work, we propose a simple yet effective training strategy MoE-Tuning for LVLMs. This strategy innovatively addresses the common issue of performance degradation in multi-modal sparsity learning, consequently constructing a sparse model with an outrageous number of parameters but a constant computational cost. Furthermore, we present the MoE-LLaVA, a MoE-based sparse LVLM architecture, which uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Extensive experiments show the significant performance of MoE-LLaVA in a variety of visual understanding and object hallucination benchmarks. Remarkably, with only approximately 3B sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmark. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems. Code is released at https://github.com/PKU-YuanGroup/MoE-LLaVA.

PDF HTML Abstract

Introduction

In the landscape of Large Vision-LLMs (LVLMs), the expansion of model parameters is a common approach to augment model capabilities, but this follows an increased computational burden during training and deployment. Dense models, where each token computation engages all model parameters, exacerbate this issue. Conversely, the Mixture of Experts (MoE) approach has exhibited success in scaling model capacity with fixed computational costs, particularly in the field of NLP.

Methodology: MoE-LLaVA and MoE-tuning

The paper introduces MoE-LLaVA, a framework for sparse LVLMs that leverages an MoE architecture with carefully engineered routers to selectively activate only the top-k experts. This unique configuration enables the maintenance of a constant computational cost while significantly expanding the model's parameter number. The framework consists of a vision encoder, visual projection layer, word embedding layer, LLM blocks, and sparse MoE blocks. The MoE-tuning strategy employs a novel three-stage training process to adapt MoE to LVLMs without performance degradation typically caused by model sparsity.

Experimental Results

Extensive experimentation validates the efficacy of MoE-LLaVA. When benchmarked against multiple visual understanding datasets, models with an unreasonably small parameter count of 3 billion—activated only sparsely—rivaled the performance of LLaVA models with up to 7 billion parameters. The authors establish that MoE-LLaVA delivers performance equivalent to dense LVLMs while requiring fewer computational resources, thus marking a significant contribution towards efficient multi-modal learning.

Contributions and Implications

The primary contributions are multifold:

The innovation of MoE-tuning methodology for adapting MoE to LVLMs, which prevents degradation due to sparsity.
The establishment of MoE-LLaVA, a pioneering framework for sparse LVLMs, which allows for substantial model size without proportional increases in computational demands.
The demonstration through experiments that MoE-LLaVA possesses superior capabilities in multi-modal understanding and exhibits an impressive restraint on hallucination — it outpaces 13-billion-parameter models using only 3 billion sparsely activated parameters.

In theory, MoE-LLaVA has set a new precedent for developing scalable and efficient LVLMs. Results indicate that the paper's contributions could redefine model scaling paradigms, presenting a model that effectively navigates the trade-off between size, performance, and computational cost, which remains a critical challenge in AI research. Future research could expand upon these findings to include a wider array of multi-modal tasks and larger MoE-based LVLMs provided that adequate data pipelines are established.

PDF Markdown Bookmark Chat (Pro)

References (102)

Authors (10)

Bin Lin (33 papers)
Zhenyu Tang (39 papers)
Yang Ye (34 papers)
Peng Jin (91 papers)
Junwu Zhang (13 papers)
Munan Ning (19 papers)
Li Yuan (141 papers)
Jinfa Huang (25 papers)
Yatian Pang (13 papers)
Jiebo Luo (355 papers)

Citations (111)

View on Semantic Scholar

GitHub

GitHub - PKU-YuanGroup/MoE-LLaVA: Mixture-of-Experts for Large Vision-Language Models (1,768 stars)

Tweets

https://twitter.com/_akhaliq/status/1752170730907680862

https://twitter.com/iScienceLuvr/status/1752163103096312164

https://twitter.com/camenduru/status/1753206687173853230

https://twitter.com/alfredplpl/status/1753042209681936708

https://twitter.com/fly51fly/status/1752333183914484147

https://twitter.com/knishimae0531/status/1753262606213898286

YouTube

Show All Videos

HackerNews

Moe-LLaVA: Mixture of Experts for Large Vision-Language Models (2 points, 0 comments)
Moe-LLaVA: Mixture of Experts for Large Vision-Language Models (2 points, 0 comments)