JavisGPT: Unified Audio-Video MLLM
- JavisGPT is an end-to-end multimodal large language model that fuses audio and video streams for synchronized analysis and generation using a unified encoder–LLM–decoder architecture.
- Its SyncFusion module aligns audio and video embeddings at the patch level, ensuring precise temporally coherent outputs through specialized cross-attention mechanisms.
- A progressive three-stage training pipeline with multimodal instruction tuning and a dedicated dialogue dataset enhances its joint comprehension and generative capabilities.
JavisGPT is an end-to-end multimodal LLM (MLLM) specifically designed for joint comprehension and generation of synchronized sounding videos, where temporally aligned video and audio streams are processed together. Distinguished as the first unified MLLM for the joint audio-video (JAV) domain, JavisGPT pioneers a concise encoder–LLM–decoder architecture that enables both fine-grained multimodal understanding and temporally coherent generation from open multimodal instructions (Liu et al., 28 Dec 2025).
1. Model Architecture
JavisGPT adopts a modular encoder–LLM–decoder structure. On the encoder side, it leverages a frozen Qwen2.5-VL vision Transformer to extract visual features and a frozen BEATs audio encoder for audio representation. These streams are unified by the SyncFusion module, which fuses video and audio tokens at the patch level, resulting in “SyncAV” embeddings that capture temporally aligned audio-visual cues. User text tokens and a dedicated set of learnable query tokens (JavisQueries) are concatenated with these SyncAV embeddings and fed into a frozen Qwen2.5-LLM backbone with LoRA adapters for parameter-efficient tuning.
For generation, when prompted by a special AV-start token, the LLM outputs two sets of conditioning embeddings through distinct multilayer perceptrons (MLPs)—semantic conditions (JavisCond) and spatio-temporal (ST) priors—which condition a frozen, pretrained Joint Audio-Video DiT (JAV-DiT) diffusion generator to synthesize synchronized video and audio. This hierarchical interface, via learnable queries, robustly bridges the LLM’s latent space and the diffusion model’s conditioning interface.
SyncFusion Module
SyncFusion is responsible for explicit spatio-temporal alignment. Given video features and audio features , SyncFusion segments audio features to match video frames and applies a shared cross-attention block to inject temporally matched audio cues into each visual patch:
A subsequent MLP refines the fused patch embeddings. This results in patch-level SyncAV tokens with embedded, local audio patterns, enabling precise audio–video synchrony during comprehension.
Synchrony-Aware Learnable Queries
Two hierarchically organized query banks are central to generation:
- Semantic queries extract high-level, global intentions, later projected to semantic condition embeddings via a two-layer MLP.
- ST-prior queries capture fine-grained, temporally resolved priors, also projected via a dedicated MLP.
Alignment losses enforce that these query-based embeddings , match the JAV-DiT’s native representations, using
The overall diffusion objective for generation is:
where is the ground truth diffusion noise and is the denoiser output by JAV-DiT.
2. Progressive Three-Stage Training Pipeline
The training regime systematically builds multimodal comprehension and generation ability through three progressive stages.
- Stage I: Multimodal Pretraining (MM-PT): Tasks are Audio→Text and Text→Audio–Video. 600K audio–text pairs and 1.5M sounding-video captions are used. Only the audio projector, queries, and their projectors are updated, with next-token prediction and alignment losses ensuring the LLM ingests audio and semantic queries align with JAV-DiT.
- Stage II: Audio–Video Fine-Tuning (AV-FT): Extends to AV→Text and Text→AV tasks on 360K–450K audio–video–text triplets. The SyncFusion module, queries, their projectors, and LLM LoRA adapters are trained. Losses now include both alignment and a diffusion loss, refining spatio-temporal synchrony.
- Stage III: Multimodal Instruction Tuning (MM-Inst): Trains all adaptation layers on ~200K open-ended, multi-turn audio–video–text dialogues (JavisInst-Omni), with comprehensive loss objectives. This stage enables dialogic, context-aware, generative multimodal capabilities.
All training runs for one epoch using LoRA (rank 128, ), learning rates decaying from to , batch sizes from 256 to 64, and 0.03 warm-up. The estimated training cost is approximately 30 GPU-days on eight A100 GPUs.
3. JavisInst-Omni Instruction Dataset
To support instruction tuning and richer multimodal reasoning, JavisInst-Omni was constructed. Using GPT-4o synthesis and filtering, it contains over 200K dialogues in two primary subsets:
- JavisInst-Und (110K dialogues): Focused on comprehension (entity-, relation-, and global-level reasoning). Samples span single-turn QA (multiple choice, open-ended) and multi-turn QA, including scenarios where a generated sounding video is followed by understanding questions.
- JavisInst-Gen (90K dialogues): Focused on generation. Task types include text-to-AV instructions, conditional generation/editing, and proactive, multi-turn scenarios. Construction involved TAVGBench captions, GPT-4o prompt templates, and domain-specific tools (e.g., FoleyCrafter), with 10–20% instructions paraphrased for diversity.
This dataset is the foundation for model instruction-following, open-ended reasoning, and dialogic control across both comprehension and generative use cases.
4. Experimental Results and Comparative Analysis
JavisGPT exhibits state-of-the-art performance on joint audio–video comprehension and generation benchmarks. Notable results:
Joint Audio–Video Comprehension
On zero-shot video (ActivityNet-QA, Perception, MVBench), audio (ClothoAQA, TUT2017), and AV (AVQA, MU-AVQA, AVSD) benchmarks:
| Metric | JavisGPT | Prev. SOTA (Qwen2.5-Omni) |
|---|---|---|
| AVQA | 93.8% | 91.5% |
| MU-AVQA | 82.1% | 79.9% |
| AVSD | 62.2% | 62.8% |
JavisGPT matches or surpasses strong unimodal MLLMs (Qwen2.5-VL, Qwen2.5-Audio), and demonstrates superiority over prior AV-LLMs.
Text-to-Audio-Video Generation
On JavisBench-mini, JavisGPT demonstrates:
| Metric | JavisGPT | JAV-DiT |
|---|---|---|
| FVD↓ | 317.5 | 327.8 |
| TV-IB↑ | 0.145 | 0.141 |
| TA-IB↑ | 0.180 | 0.184 |
| AVHScore↑ | 0.185 | 0.181 |
| JavisScore↑ | 0.157 | 0.153 |
Metrics span AV-Quality (FVD, KVD, FAD), Text-Consistency (TV-IB, TA-IB, CLIP-Score, CLAP-Score), AV-Consistency, and AV-Synchrony (JavisScore).
Human Evaluation
In 100 multi-turn, interleaved comprehension/generation dialogues, three annotators rate five dimensions (instruction following, QA accuracy, generation quality, context reasoning, proactive thinking) on a 0–5 scale. JavisGPT consistently outperforms NExT-GPT and UnifiedIO-2, especially in proactive, context-aware generation.
Ablations
- SyncFusion vs. alternatives: SyncFusion achieves higher AV comprehension (AVQA: 93.4, MU-AVQA: 81.4, AVSD: 62.0) than concat/interleave/P-former (lower scores, 3.5K tokens/frame, 2× latency).
- Training pipeline: Omission of AV-FT or MM-PT degrades generation quality, text consistency, and synchrony; full three-stage setup is necessary.
- ST-prior query ablation: Slight drop in JavisScore (0.157→0.150) with negligible effect on FVD/FAD, supporting its targeted role in synchrony.
- Backbone scaling: Moving from 7B to Qwen2.5-VL yields notable improvements, especially for AV generation.
5. Limitations and Future Directions
Architectural Gaps
- Objective Misalignment: The use of next-token prediction loss for comprehension versus diffusion loss for generation results in gradient inconsistency in the shared LLM space.
- Asymmetric I/O Modeling: Inputs rely on continuous AV embeddings while outputs depend on query-based condition interfaces, limiting bi-directional synergy between comprehension and generation.
Future directions proposed include fully unified autoregressive-diffusion models (EMU3-style) that handle comprehension and generation within a shared token space, and discrete/continuous multimodal AR paradigms enabling mutual enhancement.
Scaling and Modalities
Current studies employ a 7B LLM and existing AV datasets. Scaling to 70B+ LLMs and training on trillions of multimodal tokens is anticipated to further enhance capabilities. Integrating speech recognition/synthesis components (e.g., Whisper, WavTokenizer) is identified as a path toward handling spoken dialogue and richer modalities.
Human Preference Alignment
While instruction tuning provides base-level user alignment, reinforcement learning from human feedback (RLHF) is suggested for boosting reasoning quality, safety, and fidelity—especially in synchrony and style control.
6. Significance and Outlook
JavisGPT operationalizes a unified multimodal paradigm, combining a novel spatio-temporal fusion encoder (SyncFusion) with hierarchical, synchrony-aware query decoders to achieve both comprehensive joint audio-video understanding and temporally synchronized generation. The three-stage adaptive pipeline and the scale and diversity of JavisInst-Omni are foundational to its performance. The architecture and methodology of JavisGPT establish a blueprint for future large-scale, truly unified multimodal foundation models capable of bidirectional, context-aware, and temporally coherent reasoning and generation across audio-visual domains (Liu et al., 28 Dec 2025).