MoExtend: Tuning New Experts for Modality and Task Extension (2408.03511v1)

Published 7 Aug 2024 in cs.CV and cs.CL

Abstract: LLMs excel in various tasks but are primarily trained on text data, limiting their application scope. Expanding LLM capabilities to include vision-language understanding is vital, yet training them on multimodal data from scratch is challenging and costly. Existing instruction tuning methods, e.g., LLAVA, often connects a pretrained CLIP vision encoder and LLMs via fully fine-tuning LLMs to bridge the modality gap. However, full fine-tuning is plagued by catastrophic forgetting, i.e., forgetting previous knowledge, and high training costs particularly in the era of increasing tasks and modalities. To solve this issue, we introduce MoExtend, an effective framework designed to streamline the modality adaptation and extension of Mixture-of-Experts (MoE) models. MoExtend seamlessly integrates new experts into pre-trained MoE models, endowing them with novel knowledge without the need to tune pretrained models such as MoE and vision encoders. This approach enables rapid adaptation and extension to new modal data or tasks, effectively addressing the challenge of accommodating new modalities within LLMs. Furthermore, MoExtend avoids tuning pretrained models, thus mitigating the risk of catastrophic forgetting. Experimental results demonstrate the efficacy and efficiency of MoExtend in enhancing the multimodal capabilities of LLMs, contributing to advancements in multimodal AI research. Code: https://github.com/zhongshsh/MoExtend.

PDF HTML Abstract

The paper "MoExtend: Tuning New Experts for Modality and Task Extension" addresses a crucial challenge in expanding the capabilities of LLMs to include vision-language understanding. Traditional LLMs are primarily trained on text data, which limits their application to purely textual tasks. The integration of multimodal data — combining text and vision — enhances the versatility of LLMs but poses significant training challenges due to high costs and complexity.

Existing methodologies often involve connecting a pretrained CLIP vision encoder with LLMs through full fine-tuning. However, this approach struggles with issues like catastrophic forgetting, where the model loses previously acquired knowledge while adapting to new tasks or modalities. Additionally, the training demands increase with the inclusion of new tasks and modalities.

The authors propose MoExtend, a novel framework that effectively addresses these challenges by enabling modality adaptation and extension within Mixture-of-Experts (MoE) models. MoExtend allows for the integration of new experts into pre-trained MoE models, thereby introducing novel knowledge without the need to fine-tune existing pretrained models or vision encoders. This streamlined process reduces the risk of catastrophic forgetting and significantly lowers training costs.

MoExtend enhances the adaptability of LLMs to accommodate new modal data or tasks efficiently. The framework empowers LLMs with enhanced multimodal capabilities, thereby contributing to advancements in multimodal AI research. Experimental results presented in the paper demonstrate MoExtend's efficacy in improving the multimodal performance of LLMs, making it a promising solution for expanding the application scope of these models without extensive retraining.