MoAI: Mixture of All Intelligence for Large Language and Vision Models (2403.07508v3)

Published 12 Mar 2024 in cs.CV

Abstract: The rise of LLMs and instruction tuning has led to the current trend of instruction-tuned large language and vision models (LLVMs). This trend involves either meticulously curating numerous instruction tuning datasets tailored to specific objectives or enlarging LLVMs to manage vast amounts of vision language (VL) data. However, current LLVMs have disregarded the detailed and comprehensive real-world scene understanding available from specialized computer vision (CV) models in visual perception tasks such as segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR). Instead, the existing LLVMs rely mainly on the large capacity and emergent capabilities of their LLM backbones. Therefore, we present a new LLVM, Mixture of All Intelligence (MoAI), which leverages auxiliary visual information obtained from the outputs of external segmentation, detection, SGG, and OCR models. MoAI operates through two newly introduced modules: MoAI-Compressor and MoAI-Mixer. After verbalizing the outputs of the external CV models, the MoAI-Compressor aligns and condenses them to efficiently use relevant auxiliary visual information for VL tasks. MoAI-Mixer then blends three types of intelligence (1) visual features, (2) auxiliary features from the external CV models, and (3) language features by utilizing the concept of Mixture of Experts. Through this integration, MoAI significantly outperforms both open-source and closed-source LLVMs in numerous zero-shot VL tasks, particularly those related to real-world scene understanding such as object existence, positions, relations, and OCR without enlarging the model size or curating extra visual instruction tuning datasets.

Authors (4)

Byung-Kwan Lee (14 papers)
Beomchan Park (6 papers)
Chae Won Kim (10 papers)
Yong Man Ro (91 papers)

Citations (14)

View on Semantic Scholar

Summary

An Overview of MoAI: Mixture of All Intelligence for Large Language and Vision Models

In recent years, LLMs have seen extensive developments and application in various domains. The expansion has also included large language and vision models (LLVMs) that aim to integrate vision with language to enhance understanding and task performance. The paper "MoAI: Mixture of All Intelligence for Large Language and Vision Models" presents a novel LLVM named MoAI, devised to bridge existing gaps by incorporating auxiliary visual information from detailed real-world scene understanding. The authors offer a robust and efficient model architecture designed to enhance vision-language (VL) tasks without escalating model size or curating additional datasets.

MoAI distinguishes itself by bypassing the expansion of model size or datasets and instead employs outputs from external segmentation, detection, scene graph generation (SGG), and optical character recognition (OCR) models as auxiliary information. This auxiliary visual data helps MoAI improve its visual perception capabilities significantly. The architecture is structured around two main components: the MoAI-Compressor and MoAI-Mixer, which methodically process and integrate additional information.

Components of MoAI

MoAI-Compressor: This module is designed to align and condense verbalized outputs of external CV models. It effectively prepares compressed tokens that are utilized in VL tasks. This compression ensures computational efficiency and model effectiveness by handling visual information in a concise form.
MoAI-Mixer: Based on the Mixture of Experts framework, the MoAI-Mixer harmonizes three types of intelligence: visual features, auxiliary features from CV models, and language features. These are integrated and processed using cross- and self-attention modules. Through a combination of expert modules and gating networks, MoAI-Mixer determines optimal feature weighting, contributing to MoAI’s superior performance without increasing model scale.

Experimental Results

MoAI’s performance has been evaluated against both open-source and closed-source LLVMs. The results demonstrate that MoAI excels in zero-shot vision-language tasks, particularly those involving real-world scene understanding such as object existence and textual recognition. The metrics indicate that MoAI outperforms its counterparts in several rigorous VL benchmarks like MME, SEED, and MM-Bench, providing a comprehensive examination that underscores its enhanced visual perception capabilities.

The experiments indicate strong numerical results, with MoAI achieving high accuracy across various datasets without requiring additional visual instruction tuning datasets or increased model scale. The efficiency observed in its deployment showcases the practical implications of integrating diverse auxiliary visual information for improving real-world scene understanding.

Implications and Future Directions

The implications of MoAI’s development are twofold. Practically, MoAI provides an efficient and powerful tool suitable for diverse applications requiring scene understanding, such as autonomous driving, robotics, and complex user-interface interactions. Theoretically, MoAI’s design offers a reference framework for future studies aiming to streamline and enhance multi-modal model architectures without cumbersome scaling.

Looking to the future, there is potential to expand the external CV models utilized within MoAI. This could include additional low-level vision capabilities, more comprehensive non-object perceptions, and advanced problem-solving abilities. Addressing other vision-language aspects could further consolidate LLVM efficacy, guiding both academic and industrial advancements.

In conclusion, MoAI sets a new standard for integrating extensive vision and language information without traditional burdens of scaling, demonstrating a formidable advancement in vision-LLM capabilities.