Papers
Topics
Authors
Recent
Search
2000 character limit reached

NExT-GPT Multimodal LLM System

Updated 4 February 2026
  • NExT-GPT is an end-to-end multimodal LLM that fuses text, image, video, and audio inputs using frozen encoders and lightweight projection layers.
  • It introduces Modality-Switching Instruction Tuning to enable complex, multi-turn dialogues and robust cross-modal reasoning for tasks like captioning and editing.
  • The model achieves state-of-the-art results with only 1% tunable parameters, ensuring efficient training and rapid modality extension.

NExT-GPT is an end-to-end, general-purpose, any-to-any multimodal LLM system designed to perform both perception and generation across arbitrary combinations of text, images, video, and audio. Its architecture leverages frozen high-performance encoders and diffusion-based decoders, while confining all learning to lightweight projection and adaptation layers. This parameter-efficient approach enables rapid extension to new modalities, supports complex cross-modal reasoning, and delivers state-of-the-art results in both cross-modal understanding and generative tasks (Wu et al., 2023).

1. System Architecture and Components

NExT-GPT employs a unified, three-tier design comprising:

  • Multimodal Encoding: Non-text inputs (image, video, audio) are processed by a shared, frozen ImageBind encoder, generating modality-specific feature vectors xix_i. Each xix_i is linearly projected into the Vicuna-7B LLM embedding space via a lightweight adapter:

hi=Wixi+bi,WiRdmodel×dih_i = W_i x_i + b_i, \quad W_i \in \mathbb{R}^{d_{\mathrm{model}} \times d_i}

where dmodel=4096d_{\mathrm{model}} = 4096 and did_i is the output dimension of the encoder for modality ii. Text inputs are embedded natively by the LLM.

  • LLM Reasoning: All projected multimodal features {hi}\{h_i\} and tokenized text are sent to a frozen Vicuna-7B LLM, which jointly attends to these inputs. The LLM produces conventional text tokens and/or discrete “signal” tokens such as IMGk\langle\mathtt{IMG}_k\rangle, VIDk\langle\mathtt{VID}_k\rangle, AUDk\langle\mathtt{AUD}_k\rangle, to request corresponding output modalities.
  • Multimodal Decoding: Signal tokens are embedded by small transformer-based output adapters:

uj=WjEmbed(SIGj)+bj,WjRdcond×dsigu_j = W'_j\,\mathrm{Embed}(\langle\mathrm{SIG}_j\rangle) + b'_j, \quad W'_j\in\mathbb{R}^{d_{\mathrm{cond}} \times d_{\mathrm{sig}}}

These produce conditioning vectors cc for the respective frozen diffusion decoders: - Stable Diffusion v1.5: for images - Zeroscope v2: for video - AudioLDM L-full: for audio

Decoding is steered by cross-attention using the projected LLM signals uju_j.

Parameterization Overview:

Component Status Params (M/B)
ImageBind Encoder Frozen 1.2B
Input Projections Tunable ~4M per modality
Vicuna-7B LLM Frozen 7B
Output Projections Tunable 31M–32M per modality
Diffusion Decoders Frozen 1.3B/1.8B/975M
LoRA LLM Layers Tunable ~33M
Total Tunable Tunable ~131M (≈1% of total)

During training, only the projection layers and LoRA-injected LLM weights are updated, yielding ≈10× efficiency savings compared to full-model finetuning.

2. Modality-Switching Instruction Tuning (MosIT)

To induce rich cross-modal conversational and generation ability, NExT-GPT introduces Modality-Switching Instruction Tuning (MosIT). Three IT data categories are considered:

  1. Text+X → Text: Standard multimodal QA.
  2. Text → Text+X: Text-to-multimodal generation.
  3. MosIT (novel): Multi-turn dialogues with modality changes on both input and output.

The MosIT dataset is manually curated:

  • Template-based multi-turn (3–7) dialogues are expanded to >100 topics using GPT-4.
  • Media is matched to non-text turns by retrieval or AIGC tools.
  • Human experts filter for quality, yielding 5,000 dialogues covering all 16 modality-switching cases.

The instruction-tuning objective is:

LIT=E(X,Y)DITt=1Tlogpθ(yty<t,X)+λLdec_align\mathcal{L}_{\mathrm{IT}} = -\mathbb{E}_{(X,Y)\sim\mathcal{D}_{\mathrm{IT}}}\sum_{t=1}^T \log p_\theta(y_t|y_{<t}, X) + \lambda\,\mathcal{L}_{\mathrm{dec\_align}}

where Ldec_align\mathcal{L}_{\mathrm{dec\_align}} (decoding alignment loss) minimizes the MSE between the output adapter’s embedding uu and the diffusion condition encoder’s text embedding of the reference caption:

Ldec_align=Eufcond(caption)2\mathcal{L}_{\mathrm{dec\_align}} = \mathbb{E}\|u - f_{\mathrm{cond}}(\mathrm{caption})\|^2

3. Training Paradigm and Efficiency

NExT-GPT is trained on a combination of:

  • LLaVA, MiniGPT-4 (≈300K), VideoChat (11K): existing multimodal IT datasets.
  • T2M: 14.7K auto-generated text→(text+X) samples (4.9K per modality).
  • MosIT: 5K human-curated dialogues.

All foundational encoders, the LLM backbone, and diffusion decoders remain frozen. Only 164M parameters (≈1.3% of 12.3B) are updated, including LoRA-injected LLM weights and input/output projections. This enables substantial savings in GPU memory and wall-clock time. Training typically uses 8 A100 GPUs, batch size 8–16, and a LoRA learning rate of 1×1041\times10^{-4} for ≈12 hours.

4. Quantitative Evaluation

NExT-GPT is evaluated across a range of benchmarks in text→X, X→text, text+X→X editing, and open-form any-to-any settings:

  • Text→Image (COCO-cap, FID↓): NExT-GPT: 11.28 (vs. Stable Diffusion 11.21, CoDi 11.26).
  • Text→Audio (AudioCaps, FD↓, IS↑): 23.58, 8.35 (Comparable to AudioLDM-L, CoDi).
  • Text→Video (MSR-VTT, FID↓, CLIPSIM↑): 13.04, 0.3085 (Best CLIPSIM vs. MakeVideo, CoDi).
  • Image Captioning (COCO): BLEU-4: 44.3, METEOR: 32.9, CIDEr: 156.7 (exceeding BLIP-2).
  • Audio Captioning (AudioCaps): SPIDEr: 0.521, CIDEr: 0.802 (best-in-class).
  • Video Captioning (MSR-VTT): BLEU-4: 58.4, METEOR: 38.5 (leading metrics).
  • Image Editing (COCO, CLIPAlignment↑, FID↓): 29.31, 6.52 / 27.29, 15.20.
  • Audio Editing (VCTK, MCD↓): 0.302.
  • Video Editing (DAVIS, CLIP-T & CLIP-I↑): 0.2683, 0.9645.
  • Human Evaluation (Any-to-Any QA, 1–10 scale):
    • Image: 8.0
    • Audio: 7.0
    • Video: 7.2
    • Mixed: 6.5

These results establish NExT-GPT as a tightly integrated multimodal LLM performing on par or better than prior models like BLIP-2, CoDi, mPLUG-2, DiffEdit, AudioLDM-L, Pix2Video, and others, across input-output permutation tasks (Wu et al., 2023).

5. Diffusion Decoding and Loss Functions

Each modality employs a latent diffusion approach, denoising a random variable to generate output, with conditioning on LLM-driven control signals:

Ldiff=Ez0pdata,ϵN(0,I),tUϵϵθ(zt,t,c)2\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{z_0 \sim p_{\mathrm{data}},\,\epsilon \sim \mathcal{N}(0,I),\,t \sim \mathcal{U}} \|\epsilon - \epsilon_\theta(z_t, t, c)\|^2

where zt=αˉtz0+1αˉtϵz_t = \sqrt{\bar{\alpha}_t}\,z_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon. During generation, uju_j is injected into UNet cross-attention as the conditioning vector cc.

This approach decouples multimodal prompt understanding from generative fidelity, relying on robust, frozen decoders for high-quality image, audio, and video synthesis.

6. Limitations and Prospects for Extension

  • Modalities: Currently restricted to text, image, video, and audio. Extension to web pages, point clouds, tables, 3D, and heat maps is anticipated.
  • Generation Quality: Output fidelity is occasionally limited by frozen, off-the-shelf diffusion models. Integrating retrieval-augmented generation could mitigate this.
  • LLM Backbone: Presently based on Vicuna-7B. Employing larger or domain-specialized LLMs could advance reasoning abilities.
  • Instruction Tuning Scale: 5K MosIT dialogues form an initial base; larger-scale curation is likely to enhance cross-modal alignment and task following.

A plausible implication is that further expansion and generalization of the modality set, as well as alignment with state-of-the-art generative backbones and larger LLMs, could position any-to-any multimodal LLMs as central agents for universal modality modeling and complex cross-modal communication.

7. Significance within Multimodal AI Research

NExT-GPT introduces the first end-to-end architecture enabling both arbitrary multimodal input understanding and output synthesis, under a single, instruction-tuned LLM framework. By updating just ≈1% of total parameters, it demonstrates scalable efficiency while outperforming or matching task-specific models across diverse benchmarks. The modality-switching instruction-tuning and unification of perception and generation mark a substantive advance toward building AI agents with more human-like, generalizable multimodal abilities (Wu et al., 2023).

For further technical details, the original publication and project page provide additional experimental evidence and implementation specifics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NExT-GPT.