NExT-GPT Multimodal LLM System
- NExT-GPT is an end-to-end multimodal LLM that fuses text, image, video, and audio inputs using frozen encoders and lightweight projection layers.
- It introduces Modality-Switching Instruction Tuning to enable complex, multi-turn dialogues and robust cross-modal reasoning for tasks like captioning and editing.
- The model achieves state-of-the-art results with only 1% tunable parameters, ensuring efficient training and rapid modality extension.
NExT-GPT is an end-to-end, general-purpose, any-to-any multimodal LLM system designed to perform both perception and generation across arbitrary combinations of text, images, video, and audio. Its architecture leverages frozen high-performance encoders and diffusion-based decoders, while confining all learning to lightweight projection and adaptation layers. This parameter-efficient approach enables rapid extension to new modalities, supports complex cross-modal reasoning, and delivers state-of-the-art results in both cross-modal understanding and generative tasks (Wu et al., 2023).
1. System Architecture and Components
NExT-GPT employs a unified, three-tier design comprising:
- Multimodal Encoding: Non-text inputs (image, video, audio) are processed by a shared, frozen ImageBind encoder, generating modality-specific feature vectors . Each is linearly projected into the Vicuna-7B LLM embedding space via a lightweight adapter:
where and is the output dimension of the encoder for modality . Text inputs are embedded natively by the LLM.
- LLM Reasoning: All projected multimodal features and tokenized text are sent to a frozen Vicuna-7B LLM, which jointly attends to these inputs. The LLM produces conventional text tokens and/or discrete “signal” tokens such as , , , to request corresponding output modalities.
- Multimodal Decoding: Signal tokens are embedded by small transformer-based output adapters:
These produce conditioning vectors for the respective frozen diffusion decoders: - Stable Diffusion v1.5: for images - Zeroscope v2: for video - AudioLDM L-full: for audio
Decoding is steered by cross-attention using the projected LLM signals .
Parameterization Overview:
| Component | Status | Params (M/B) |
|---|---|---|
| ImageBind Encoder | Frozen | 1.2B |
| Input Projections | Tunable | ~4M per modality |
| Vicuna-7B LLM | Frozen | 7B |
| Output Projections | Tunable | 31M–32M per modality |
| Diffusion Decoders | Frozen | 1.3B/1.8B/975M |
| LoRA LLM Layers | Tunable | ~33M |
| Total Tunable | Tunable | ~131M (≈1% of total) |
During training, only the projection layers and LoRA-injected LLM weights are updated, yielding ≈10× efficiency savings compared to full-model finetuning.
2. Modality-Switching Instruction Tuning (MosIT)
To induce rich cross-modal conversational and generation ability, NExT-GPT introduces Modality-Switching Instruction Tuning (MosIT). Three IT data categories are considered:
- Text+X → Text: Standard multimodal QA.
- Text → Text+X: Text-to-multimodal generation.
- MosIT (novel): Multi-turn dialogues with modality changes on both input and output.
The MosIT dataset is manually curated:
- Template-based multi-turn (3–7) dialogues are expanded to >100 topics using GPT-4.
- Media is matched to non-text turns by retrieval or AIGC tools.
- Human experts filter for quality, yielding 5,000 dialogues covering all 16 modality-switching cases.
The instruction-tuning objective is:
where (decoding alignment loss) minimizes the MSE between the output adapter’s embedding and the diffusion condition encoder’s text embedding of the reference caption:
3. Training Paradigm and Efficiency
NExT-GPT is trained on a combination of:
- LLaVA, MiniGPT-4 (≈300K), VideoChat (11K): existing multimodal IT datasets.
- T2M: 14.7K auto-generated text→(text+X) samples (4.9K per modality).
- MosIT: 5K human-curated dialogues.
All foundational encoders, the LLM backbone, and diffusion decoders remain frozen. Only 164M parameters (≈1.3% of 12.3B) are updated, including LoRA-injected LLM weights and input/output projections. This enables substantial savings in GPU memory and wall-clock time. Training typically uses 8 A100 GPUs, batch size 8–16, and a LoRA learning rate of for ≈12 hours.
4. Quantitative Evaluation
NExT-GPT is evaluated across a range of benchmarks in text→X, X→text, text+X→X editing, and open-form any-to-any settings:
- Text→Image (COCO-cap, FID↓): NExT-GPT: 11.28 (vs. Stable Diffusion 11.21, CoDi 11.26).
- Text→Audio (AudioCaps, FD↓, IS↑): 23.58, 8.35 (Comparable to AudioLDM-L, CoDi).
- Text→Video (MSR-VTT, FID↓, CLIPSIM↑): 13.04, 0.3085 (Best CLIPSIM vs. MakeVideo, CoDi).
- Image Captioning (COCO): BLEU-4: 44.3, METEOR: 32.9, CIDEr: 156.7 (exceeding BLIP-2).
- Audio Captioning (AudioCaps): SPIDEr: 0.521, CIDEr: 0.802 (best-in-class).
- Video Captioning (MSR-VTT): BLEU-4: 58.4, METEOR: 38.5 (leading metrics).
- Image Editing (COCO, CLIPAlignment↑, FID↓): 29.31, 6.52 / 27.29, 15.20.
- Audio Editing (VCTK, MCD↓): 0.302.
- Video Editing (DAVIS, CLIP-T & CLIP-I↑): 0.2683, 0.9645.
- Human Evaluation (Any-to-Any QA, 1–10 scale):
- Image: 8.0
- Audio: 7.0
- Video: 7.2
- Mixed: 6.5
These results establish NExT-GPT as a tightly integrated multimodal LLM performing on par or better than prior models like BLIP-2, CoDi, mPLUG-2, DiffEdit, AudioLDM-L, Pix2Video, and others, across input-output permutation tasks (Wu et al., 2023).
5. Diffusion Decoding and Loss Functions
Each modality employs a latent diffusion approach, denoising a random variable to generate output, with conditioning on LLM-driven control signals:
where . During generation, is injected into UNet cross-attention as the conditioning vector .
This approach decouples multimodal prompt understanding from generative fidelity, relying on robust, frozen decoders for high-quality image, audio, and video synthesis.
6. Limitations and Prospects for Extension
- Modalities: Currently restricted to text, image, video, and audio. Extension to web pages, point clouds, tables, 3D, and heat maps is anticipated.
- Generation Quality: Output fidelity is occasionally limited by frozen, off-the-shelf diffusion models. Integrating retrieval-augmented generation could mitigate this.
- LLM Backbone: Presently based on Vicuna-7B. Employing larger or domain-specialized LLMs could advance reasoning abilities.
- Instruction Tuning Scale: 5K MosIT dialogues form an initial base; larger-scale curation is likely to enhance cross-modal alignment and task following.
A plausible implication is that further expansion and generalization of the modality set, as well as alignment with state-of-the-art generative backbones and larger LLMs, could position any-to-any multimodal LLMs as central agents for universal modality modeling and complex cross-modal communication.
7. Significance within Multimodal AI Research
NExT-GPT introduces the first end-to-end architecture enabling both arbitrary multimodal input understanding and output synthesis, under a single, instruction-tuned LLM framework. By updating just ≈1% of total parameters, it demonstrates scalable efficiency while outperforming or matching task-specific models across diverse benchmarks. The modality-switching instruction-tuning and unification of perception and generation mark a substantive advance toward building AI agents with more human-like, generalizable multimodal abilities (Wu et al., 2023).
For further technical details, the original publication and project page provide additional experimental evidence and implementation specifics.