LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation (2408.15881v3)

Published 28 Aug 2024 in cs.CV

Abstract: We introduce LLaVA-MoD, a novel framework designed to enable the efficient training of small-scale Multimodal LLMs (s-MLLM) by distilling knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental challenges in MLLM distillation. First, we optimize the network structure of s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the LLM, striking a balance between computational efficiency and model expressiveness. Second, we propose a progressive knowledge transfer strategy to ensure comprehensive knowledge migration. This strategy begins with mimic distillation, where we minimize the Kullback-Leibler (KL) divergence between output distributions to enable the student model to emulate the teacher network's understanding. Following this, we introduce preference distillation via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM as the reference model. During this phase, the s-MLLM's ability to discriminate between superior and inferior examples is significantly enhanced beyond l-MLLM, leading to a better student that surpasses its teacher, particularly in hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD outperforms existing models across various multimodal benchmarks while maintaining a minimal number of activated parameters and low computational costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of the training data and 23% trainable parameters. These results underscore LLaVA-MoD's ability to effectively distill comprehensive knowledge from its teacher model, paving the way for the development of more efficient MLLMs. The code will be available on: https://github.com/shufangxun/LLaVA-MoD.

PDF HTML Abstract

Overview of LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation

The paper "LLaVA-MoD: Making LLaVA Tiny via MoE-Knowledge Distillation" introduces an innovative approach to effectively reduce the scale of Multimodal LLMs (MLLMs) while maintaining or enhancing their performance through a novel framework called LLaVA-MoD (LLaVA Mixture-of-Expert Knowledge Distillation). This approach aims to distill the extensive knowledge embedded in large-scale models (l-MLLMs) into smaller and more efficient models (s-MLLMs) by using a combination of a Mixture of Experts (MoE) architecture and progressive knowledge transfer strategies.

Key Concepts and Methodology

The paper identifies two fundamental challenges in the distillation of MLLMs:

Optimizing the Network Structure of s-MLLM:
- The integration of a sparse Mixture of Experts (MoE) architecture into the LLM aims to balance computational efficiency and model expressiveness. The sparse MoE leverages multiple feedforward networks (FFNs) and linear gates to activate the most appropriate experts dynamically, ensuring efficient knowledge capture without excessive complexity.
Progressive Knowledge Transfer:
- The progressive distillation strategy ensures comprehensive knowledge migration from l-MLLM to s-MLLM. This strategy comprises two core components:
  - Mimic Distillation: It aligns the output distribution of the student model with the teacher model using the Kullback-Leibler (KL) divergence. This stage consists of two steps: dense-to-dense and dense-to-sparse distillation. The dense-to-dense step focuses on basic knowledge alignment, while the dense-to-sparse transitions the model to a sparse configuration.
  - Preference Distillation: Employing Direct Preference Optimization (DPO), this phase enhances the student model’s capability to discern between superior and inferior examples, leading to improved performance, especially in mitigating hallucinations.

Performance and Numerical Results

The paper provides comprehensive experimental results demonstrating LLaVA-MoD’s effectiveness across various multimodal benchmarks. Notably, LLaVA-MoD, with only 2 billion activated parameters, outperforms the Qwen-VL-Chat-7B model by an average of 8.8% across multiple benchmarks, while utilizing merely 0.3% of the training data and 23% of trainable parameters. These empirical results underscore the efficacy of LLaVA-MoD in compressing and transferring knowledge from large to small models.

In hallucination benchmarks, LLaVA-MoD significantly reduces hallucination rates, outperforming even some Reinforcement Learning from Human Feedback (RLHF)-based models. For instance, LLaVA-MoD-2B surpasses the RLHF-V model by 8.2% in response-level hallucination rate and by 21.3% in mention-level hallucination rate on the Object HalBench.

Implications and Future Directions

The LLaVA-MoD framework has both practical and theoretical ramifications:

Practical Implications:

Resource Efficiency: By maintaining low computational and memory demands, LLaVA-MoD enables the deployment of powerful MLLMs on resource-constrained devices, such as mobile phones and edge devices.
Scalability: The framework's scalability is enhanced by its ability to reduce the number of activated parameters dramatically, ensuring faster inference times and reduced deployment costs.

Theoretical Implications:

Model Compression: The successful application of MoE in MLLMs opens new avenues for model compression techniques, potentially applicable to other domains beyond multimodal learning.
Knowledge Transfer: The progressive distillation approach, particularly the integration of preference-based optimization, provides a new paradigm for training smaller models to outperform their larger counterparts, especially in aspects like reducing hallucinations.

Future Directions:

Heterogeneous Model Families: Extending the distillation techniques to heterogeneous model families could address the limitations of requiring the teacher and student models to belong to the same LLM family.
Memory Efficiency: Further optimizations could focus on reducing memory consumption during the distillation process by leveraging efficient computational techniques or pre-extracting teacher model outputs.

In conclusion, LLaVA-MoD represents a significant advancement in the field of efficient multimodal learning, providing a robust framework for distilling LLMs into smaller, resource-efficient models without sacrificing performance. This research paves the way for broader applications and further innovations in model compression and knowledge distillation.