Parrot: Multilingual Visual Instruction Tuning (2406.02539v2)

Published 4 Jun 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: The rapid development of Multimodal LLMs (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available. Code is available at: https://github.com/AIDC-AI/Parrot.

PDF HTML Abstract

Parrot: Multilingual Visual Instruction Tuning

The paper "Parrot: Multilingual Visual Instruction Tuning" addresses a pertinent issue in the development of Multimodal LLMs (MLLMs), namely the deterioration of multilingual capabilities due to imbalanced training datasets predominantly featuring English-centric image-text pairs. Through a novel approach named Parrot, the authors propose a method to maintain and enhance the multilingual abilities of these models, ensuring consistent performance across a wider range of languages.

The central issue identified by the authors is the phenomenon of "multilingual erosion," where the ability of MLLMs to process and generate outputs in non-English languages diminishes as the supervised fine-tuning (SFT) process progresses. This degradation is largely attributed to the insufficient alignment of vision encoders with multilingual tokens during training. The authors argue that existing datasets, focused mainly on English image-text pairs, significantly limit the performance of the models in non-English environments.

Parrot introduces a novel solution by utilizing textual guidance at the language level to drive visual token alignment. The authors implement a Mixture-of-Experts (MoE) approach that allows visual tokens to be conditioned on diverse language inputs. This approach involves calculating cross-attention between initial visual features and textual embeddings, and subsequently feeding these into an MoE router to select the most relevant experts. The experts specialize in converting English-biased visual tokens into language-specific pairs, thus mitigating the performance drop in non-English contexts. This method proved to be effective, as evidenced by experimental results, where Parrot showcased state-of-the-art performance in the new Massive Multilingual Multimodal Benchmark (MMMB) which covers six languages.

Furthermore, Parrot demonstrated substantial capabilities across a range of multimodal tasks, highlighting its versatility and robustness. The method does not necessitate excessive non-English data, aligning visual representations across multiple languages with minimal data resources—a practical advantage in low-resource settings.

In developing the MMMB, the authors acknowledge the limitations of existing benchmarks, which are often outdated, unstandardized, or limited in language diversity. MMMB includes six diverse languages: English, Chinese, Portuguese, Arabic, Turkish, and Russian. The benchmark ensures content consistency across languages, offering a fair evaluation of MLLMs' multilingual capabilities.

The implications of this research are significant. Practically, it paves the way for more equitable access to advanced AI capabilities across different linguistic contexts, addressing digital divide issues. Theoretically, it contributes to the understanding of multimodal alignments at the intersection of vision and language processing, promoting further exploration into the integration of multilingual data in model pre-training and fine-tuning phases.

The future of AI developments could see extended applications of Parrot's methodology in various domains requiring multilingual and multimodal processing capabilities, from content moderation to automated translation systems. Additionally, further exploration into multilingual training datasets and more sophisticated MoE designs could enhance these capabilities further.

In summary, the Parrot framework presents a valuable contribution to the MLLM field, effectively addressing multilingual erosion with innovative solutions that require minimal additional resources. Its implementation could lead to significant advancements in not only maintaining multilingual capabilities during training but also in expanding them, thereby facilitating the development of more globally accessible AI systems.