LLAVADI: What Matters For Multimodal Large Language Models Distillation (2407.19409v1)

Published 28 Jul 2024 in cs.CL and cs.CV

Abstract: The recent surge in Multimodal LLMs (MLLMs) has showcased their remarkable potential for achieving generalized intelligence by integrating visual understanding into LLMs.Nevertheless, the sheer model size of MLLMs leads to substantial memory and computational demands that hinder their widespread deployment. In this work, we do not propose a new efficient model structure or train small-scale MLLMs from scratch. Instead, we focus on what matters for training small-scale MLLMs through knowledge distillation, which is the first step from the multimodal distillation perspective. Our extensive studies involve training strategies, model choices, and distillation algorithms in the knowledge distillation process. These results show that joint alignment for both tokens and logit alignment plays critical roles in teacher-student frameworks. In addition, we draw a series of intriguing observations from this study. By evaluating different benchmarks and proper strategy, even a 2.7B small-scale model can perform on par with larger models with 7B or 13B parameters. Our code and models will be publicly available for further research.

PDF HTML Abstract

Overview of LLAVADI: Distillation in Multimodal LLMs

In the paper titled "LLAVADI: What Matters For Multimodal LLMs Distillation," the authors investigate the factors critical for the effective distillation of multimodal LLMs (MLLMs). The primary focus is on understanding how smaller MLLMs can be trained effectively through knowledge distillation to achieve comparable performance to their larger counterparts. This research is significant given the increasing demand for computationally efficient models that maintain robust cross-modality capabilities.

Key Contributions

The paper's contributions are structured around four main areas of investigation in MLLM knowledge distillation:

Feature Embedding Distillation: Analyzing the distillation of intermediate hidden states between teacher and student models to align feature embeddings at various transformer layers.
Logit-Level Distillation: Evaluating methods for aligning the classification logits between models using different types of KL divergences and MSE losses.
Affinity-Aware Distillation: Investigating the potential for distilling the relationship between visual and language tokens to enhance the student's ability to emulate the teacher's understanding of visual-language associations.
Data-Driven Knowledge Distillation: Exploring the impact of generating new instruction-tuning data leveraging either the teacher or student model to improve distillation outcomes.

Methodology

The proposed distillation framework, termed LLAVADI, integrates multiple distillation strategies to train smaller MLLMs efficiently. The LLAVADI framework leverages a powerful teacher model (e.g., LLaVA-v1.5-13B) to guide a smaller student model (e.g., MobileLLaMA with 2.7B parameters) using the following approaches:

Feature Embedding Distillation: Leveraging feature alignment with cosine or MSE loss, focusing particularly on the alignment of the final transformer layers.
Logit-Level Distillation: Emphasizing the use of KL divergences, especially Jensen-Shannon divergence and forward/reverse KL to balance knowledge transfer between models.
Data Enhancement: Utilizing regenerated instruction-tuning data from the teacher model to address potential distributional discrepancies in training datasets.

Experimental Insights

The paper presents extensive empirical evaluations on several benchmarks such as VQA, SQA, POPE, and MME. The findings suggest:

Feature Embeddings: Aligning the last layer’s hidden embeddings improves model performance, while excessive layer alignment may degrade results.
Logit-Level: KL divergence methods yield better alignment of logits compared to MSE loss, with a preference for KL variations that accommodate large vocabulary sizes.
Affinity Distillation: This approach proved less effective due to its incompatibility with autoregressive training paradigms in MLLMs, unlike its efficacy in contrastive loss settings for models like CLIP.
Data-Driven Strategies: Regenerated data via the teacher model can enhance performance, although care must be taken concerning computational costs.

Implications and Future Work

The approaches demonstrated in this paper reveal robust strategies for effectively distilling large MLLMs into smaller, more deployable models without significantly sacrificing performance. This has practical implications for applications that demand efficient models due to limited computational resources. The paper lays the groundwork for future advancements in MLLM distillation techniques, promising more scalable and efficient integration of visual and textual data processing.

In summation, this paper highlights the importance of targeted distillation strategies across different aspects of the MLLM architecture, advancing the frontier in deploying powerful yet resource-efficient models in multimodal domains. Future research may explore optimizing student models' architectures and further fine-tuning data alignment strategies to bridge remaining performance gaps effectively.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Shilin Xu (17 papers)
Xiangtai Li (128 papers)
Haobo Yuan (22 papers)
Lu Qi (93 papers)
Yunhai Tong (69 papers)
Ming-Hsuan Yang (376 papers)

Related Papers

Find Related Papers

YouTube

Show All Videos