NExT-GPT Multimodal LLM System

Updated 4 February 2026

NExT-GPT is an end-to-end multimodal LLM that fuses text, image, video, and audio inputs using frozen encoders and lightweight projection layers.
It introduces Modality-Switching Instruction Tuning to enable complex, multi-turn dialogues and robust cross-modal reasoning for tasks like captioning and editing.
The model achieves state-of-the-art results with only 1% tunable parameters, ensuring efficient training and rapid modality extension.

NExT-GPT is an end-to-end, general-purpose, any-to-any multimodal LLM system designed to perform both perception and generation across arbitrary combinations of text, images, video, and audio. Its architecture leverages frozen high-performance encoders and diffusion-based decoders, while confining all learning to lightweight projection and adaptation layers. This parameter-efficient approach enables rapid extension to new modalities, supports complex cross-modal reasoning, and delivers state-of-the-art results in both cross-modal understanding and generative tasks (Wu et al., 2023).

1. System Architecture and Components

NExT-GPT employs a unified, three-tier design comprising:

Multimodal Encoding: Non-text inputs (image, video, audio) are processed by a shared, frozen ImageBind encoder, generating modality-specific feature vectors $x_i$ . Each $x_i$ is linearly projected into the Vicuna-7B LLM embedding space via a lightweight adapter:

$h_i = W_i x_i + b_i, \quad W_i \in \mathbb{R}^{d_{\mathrm{model}} \times d_i}$

where $d_{\mathrm{model}} = 4096$ and $d_i$ is the output dimension of the encoder for modality $i$ . Text inputs are embedded natively by the LLM.

LLM Reasoning: All projected multimodal features $\{h_i\}$ and tokenized text are sent to a frozen Vicuna-7B LLM, which jointly attends to these inputs. The LLM produces conventional text tokens and/or discrete “signal” tokens such as $\langle\mathtt{IMG}_k\rangle$ , $\langle\mathtt{VID}_k\rangle$ , $\langle\mathtt{AUD}_k\rangle$ , to request corresponding output modalities.
Multimodal Decoding: Signal tokens are embedded by small transformer-based output adapters:

$u_j = W'_j\,\mathrm{Embed}(\langle\mathrm{SIG}_j\rangle) + b'_j, \quad W'_j\in\mathbb{R}^{d_{\mathrm{cond}} \times d_{\mathrm{sig}}}$

These produce conditioning vectors $c$ for the respective frozen diffusion decoders: - Stable Diffusion v1.5: for images - Zeroscope v2: for video - AudioLDM L-full: for audio

Decoding is steered by cross-attention using the projected LLM signals $u_j$ .

Parameterization Overview:

Component	Status	Params (M/B)
ImageBind Encoder	Frozen	1.2B
Input Projections	Tunable	~4M per modality
Vicuna-7B LLM	Frozen	7B
Output Projections	Tunable	31M–32M per modality
Diffusion Decoders	Frozen	1.3B/1.8B/975M
LoRA LLM Layers	Tunable	~33M
Total Tunable	Tunable	~131M (≈1% of total)

During training, only the projection layers and LoRA-injected LLM weights are updated, yielding ≈10× efficiency savings compared to full-model finetuning.

2. Modality-Switching Instruction Tuning (MosIT)

To induce rich cross-modal conversational and generation ability, NExT-GPT introduces Modality-Switching Instruction Tuning (MosIT). Three IT data categories are considered:

Text+X → Text: Standard multimodal QA.
Text → Text+X: Text-to-multimodal generation.
MosIT (novel): Multi-turn dialogues with modality changes on both input and output.

The MosIT dataset is manually curated:

Template-based multi-turn (3–7) dialogues are expanded to >100 topics using GPT-4.
Media is matched to non-text turns by retrieval or AIGC tools.
Human experts filter for quality, yielding 5,000 dialogues covering all 16 modality-switching cases.

The instruction-tuning objective is:

$\mathcal{L}_{\mathrm{IT}} = -\mathbb{E}_{(X,Y)\sim\mathcal{D}_{\mathrm{IT}}}\sum_{t=1}^T \log p_\theta(y_t|y_{<t}, X) + \lambda\,\mathcal{L}_{\mathrm{dec\_align}}$

where $\mathcal{L}_{\mathrm{dec\_align}}$ (decoding alignment loss) minimizes the MSE between the output adapter’s embedding $u$ and the diffusion condition encoder’s text embedding of the reference caption:

$\mathcal{L}_{\mathrm{dec\_align}} = \mathbb{E}\|u - f_{\mathrm{cond}}(\mathrm{caption})\|^2$

3. Training Paradigm and Efficiency

NExT-GPT is trained on a combination of:

LLaVA, MiniGPT-4 (≈300K), VideoChat (11K): existing multimodal IT datasets.
T2M: 14.7K auto-generated text→(text+X) samples (4.9K per modality).
MosIT: 5K human-curated dialogues.

All foundational encoders, the LLM backbone, and diffusion decoders remain frozen. Only 164M parameters (≈1.3% of 12.3B) are updated, including LoRA-injected LLM weights and input/output projections. This enables substantial savings in GPU memory and wall-clock time. Training typically uses 8 A100 GPUs, batch size 8–16, and a LoRA learning rate of $1\times10^{-4}$ for ≈12 hours.

4. Quantitative Evaluation

NExT-GPT is evaluated across a range of benchmarks in text→X, X→text, text+X→X editing, and open-form any-to-any settings:

Text→Image (COCO-cap, FID↓): NExT-GPT: 11.28 (vs. Stable Diffusion 11.21, CoDi 11.26).
Text→Audio (AudioCaps, FD↓, IS↑): 23.58, 8.35 (Comparable to AudioLDM-L, CoDi).
Text→Video (MSR-VTT, FID↓, CLIPSIM↑): 13.04, 0.3085 (Best CLIPSIM vs. MakeVideo, CoDi).
Image Captioning (COCO): BLEU-4: 44.3, METEOR: 32.9, CIDEr: 156.7 (exceeding BLIP-2).
Audio Captioning (AudioCaps): SPIDEr: 0.521, CIDEr: 0.802 (best-in-class).
Video Captioning (MSR-VTT): BLEU-4: 58.4, METEOR: 38.5 (leading metrics).
Image Editing (COCO, CLIPAlignment↑, FID↓): 29.31, 6.52 / 27.29, 15.20.
Audio Editing (VCTK, MCD↓): 0.302.
Video Editing (DAVIS, CLIP-T & CLIP-I↑): 0.2683, 0.9645.
Human Evaluation (Any-to-Any QA, 1–10 scale):
- Image: 8.0
- Audio: 7.0
- Video: 7.2
- Mixed: 6.5

These results establish NExT-GPT as a tightly integrated multimodal LLM performing on par or better than prior models like BLIP-2, CoDi, mPLUG-2, DiffEdit, AudioLDM-L, Pix2Video, and others, across input-output permutation tasks (Wu et al., 2023).

5. Diffusion Decoding and Loss Functions

Each modality employs a latent diffusion approach, denoising a random variable to generate output, with conditioning on LLM-driven control signals:

$\mathcal{L}_{\mathrm{diff}} = \mathbb{E}_{z_0 \sim p_{\mathrm{data}},\,\epsilon \sim \mathcal{N}(0,I),\,t \sim \mathcal{U}} \|\epsilon - \epsilon_\theta(z_t, t, c)\|^2$

where $z_t = \sqrt{\bar{\alpha}_t}\,z_0 + \sqrt{1-\bar{\alpha}_t}\,\epsilon$ . During generation, $u_j$ is injected into UNet cross-attention as the conditioning vector $c$ .

This approach decouples multimodal prompt understanding from generative fidelity, relying on robust, frozen decoders for high-quality image, audio, and video synthesis.

6. Limitations and Prospects for Extension

Modalities: Currently restricted to text, image, video, and audio. Extension to web pages, point clouds, tables, 3D, and heat maps is anticipated.
Generation Quality: Output fidelity is occasionally limited by frozen, off-the-shelf diffusion models. Integrating retrieval-augmented generation could mitigate this.
LLM Backbone: Presently based on Vicuna-7B. Employing larger or domain-specialized LLMs could advance reasoning abilities.
Instruction Tuning Scale: 5K MosIT dialogues form an initial base; larger-scale curation is likely to enhance cross-modal alignment and task following.

A plausible implication is that further expansion and generalization of the modality set, as well as alignment with state-of-the-art generative backbones and larger LLMs, could position any-to-any multimodal LLMs as central agents for universal modality modeling and complex cross-modal communication.

7. Significance within Multimodal AI Research

NExT-GPT introduces the first end-to-end architecture enabling both arbitrary multimodal input understanding and output synthesis, under a single, instruction-tuned LLM framework. By updating just ≈1% of total parameters, it demonstrates scalable efficiency while outperforming or matching task-specific models across diverse benchmarks. The modality-switching instruction-tuning and unification of perception and generation mark a substantive advance toward building AI agents with more human-like, generalizable multimodal abilities (Wu et al., 2023).

For further technical details, the original publication and project page provide additional experimental evidence and implementation specifics.

Markdown Report Issue Upgrade to Chat

References (1)

NExT-GPT: Any-to-Any Multimodal LLM (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NExT-GPT.