NExT-GPT: Any-to-Any Multimodal LLM (2309.05519v3)

Published 11 Sep 2023 in cs.AI, cs.CL, and cs.LG

Abstract: While recently Multimodal LLMs (MM-LLMs) have made exciting strides, they mostly fall prey to the limitation of only input-side multimodal understanding, without the ability to produce content in multiple modalities. As we humans always perceive the world and communicate with people through various modalities, developing any-to-any MM-LLMs capable of accepting and delivering content in any modality becomes essential to human-level AI. To fill the gap, we present an end-to-end general-purpose any-to-any MM-LLM system, NExT-GPT. We connect an LLM with multimodal adaptors and different diffusion decoders, enabling NExT-GPT to perceive inputs and generate outputs in arbitrary combinations of text, images, videos, and audio. By leveraging the existing well-trained highly-performing encoders and decoders, NExT-GPT is tuned with only a small amount of parameter (1%) of certain projection layers, which not only benefits low-cost training and also facilitates convenient expansion to more potential modalities. Moreover, we introduce a modality-switching instruction tuning (MosIT) and manually curate a high-quality dataset for MosIT, based on which NExT-GPT is empowered with complex cross-modal semantic understanding and content generation. Overall, our research showcases the promising possibility of building an AI agent capable of modeling universal modalities, paving the way for more human-like AI research in the community. Project page: https://next-gpt.github.io/

PDF Abstract

An Examination of NExT-GPT: An Any-to-Any Multimodal LLM

The paper introduces NExT-GPT, an innovative approach in the field of Multimodal LLMs (MM-LLMs). It aims to address the limitation of existing models that often only handle multimodal understanding on the input side and lack the ability to generate outputs across different modalities. NExT-GPT proposes a comprehensive solution to this challenge by offering an end-to-end system that embraces any-to-any multimodal functionality. This advancement is a critical step towards achieving more human-like AI capable of interacting in various modalities, similar to human sensory and communicative capabilities.

Core Components and Architecture

The architecture of NExT-GPT is strategically crafted into three distinct tiers:

Multimodal Encoding Stage: This layer employs well-established models such as ImageBind to encode inputs from different modalities, including text, images, videos, and audio. The encoded data is converted into language-like representations via a projection layer for compatibility with LLMs.
LLM Understanding and Reasoning Stage: At the core, an open-source LLM, specifically the Vicuna model, is utilized to process the input representations, providing semantic understanding and reasoning. This stage is responsible for producing textual outputs directly and generating modality signal tokens that guide the subsequent decoding process for output generation in relevant modalities.
Multimodal Generation Stage: This stage uses diffusion models like Stable Diffusion, Zeroscope, and AudioLDM for generating output across the specified modalities based on the instructions set by the LLM signal tokens.

The integration of these tiers allows NExT-GPT to function with efficient parameter usage, updating only 1% of parameters focused on input and output projection layers, thereby supporting scalable and low-cost training.

Instruction Tuning and Dataset Preparation

To enhance the model’s capability in interpreting multifaceted user instructions, the authors introduce the Modality-switching Instruction Tuning (MosIT), alongside a newly curated dataset. This dataset, covering versatile multimodal interactions across various domains, empowers the model to grasp complex instructions requiring deep reasoning across mixed modalities.

Performance Assessment

Quantitative evaluations indicate that NExT-GPT attains competitive performance levels compared to state-of-the-art systems in tasks like text-to-image, text-to-audio, and text-to-video generation. NExT-GPT's strengths are notably apparent in content generation that requires nuanced semantic understanding, facilitated by its LLM-centric processing.

Furthermore, it’s imperative to recognize the importance of its human-like reasoning in handling broader instructional contexts, a prowess demonstrated in the MosIT dataset interactions. Nonetheless, some limitations remain, particularly in tasks requiring high-quality cross-modal editing, necessitating future improvements.

Implications and Future Directions

The implications of NExT-GPT extend into both theoretical and practical domains. Theoretically, it demonstrates a viable pathway for constructing AI models that can seamlessly transition and adapt across multiple information modalities, similar to human cognitive processes. Practically, the development of any-to-any MM-LLMs such as NExT-GPT opens new avenues for applications in fields ranging from complex human-computer interaction to potentially enhancing accessibility technologies.

For future advancements, the exploration of additional modalities, expansion of LLM variants, enhancement of multimodal generation strategies with retrieval-based complementarity, and further dataset enrichment are promising areas that can bolster the capability and expand the reach of multimodal AI systems.