JavisDiT++: Unified Audio-Video Generation
- JavisDiT++ is a unified joint audio-video generation framework that improves audiovisual realism, synchrony, and preference alignment.
- It leverages modality-specific mixture-of-experts and temporal-aligned rotary encoding to maintain distinct quality while ensuring cross-modal consistency.
- AV-DPO optimizes outputs using human-preference data, enhancing generation quality, synchronization, and audiovisual harmony in practical scenarios.
Searching arXiv for the specified papers to ground the article with current metadata. JavisDiT++ is a unified joint audio-video generation (JAVG) model designed to generate a sounding video from a text prompt, with the key goal of improving generation quality of both audio and video, temporal synchrony between what is seen and what is heard, and alignment with human preferences, especially for realism, consistency, and audiovisual harmony. It is presented as a concise yet powerful framework for unified modeling and optimization of JAVG, built on Wan2.1-1.3B-T2V and trained with about 1M public data entries. The model introduces three targeted components—modality-specific mixture-of-experts (MS-MoE), temporal-aligned Rotary Position Encoding (TA-RoPE), and audio-video direct preference optimization (AV-DPO)—and is positioned as an open-source step toward closing the gap with commercial systems such as Veo3, which current open-source JAVG models still lag behind in realism, synchronization, and subjective appeal (Liu et al., 22 Feb 2026).
1. Problem formulation and research context
JavisDiT++ addresses a core limitation of existing open-source JAVG methods: they can generate audio and video, but remain weaker than proprietary systems in visual quality and audio fidelity, cross-modal consistency, fine-grained synchrony, and human preference alignment (Liu et al., 22 Feb 2026). The task is not merely multimodal synthesis in parallel; it is the generation of synchronized and semantically aligned sound and vision from textual descriptions, where quality, temporal alignment, and subjective coherence must all be optimized simultaneously.
The paper frames prior open-source approaches as making difficult architecture trade-offs. Some methods use single shared latent spaces or unified token streams, which can cause modality information loss. Others use dual-stream or stitched expert architectures, which become large, complex, and inefficient. Still others rely on implicit synchronization modules such as ST-prior or cross-attention, which are described as indirect and expensive. The paper further argues that almost none explicitly incorporate preference learning for JAVG (Liu et al., 22 Feb 2026).
This positioning is best understood against the immediate predecessor, JavisDiT, a joint audio-video diffusion transformer for synchronized audio-video generation that emphasized fine-grained spatio-temporal alignment through a Hierarchical Spatial-Temporal Synchronized Prior Estimator and introduced JavisBench and JavisScore (Liu et al., 30 Mar 2025). JavisDiT++ preserves the ambition of synchronized end-to-end JAVG while reformulating the solution around a simpler unified design. This suggests a shift in emphasis from prior-guided synchronization toward integrated architectural and optimization mechanisms that jointly target quality, synchrony, and preference alignment.
2. Backbone, training scale, and optimization stages
JavisDiT++ is built on Wan2.1-1.3B-T2V (Liu et al., 22 Feb 2026). The backbone details given are a 30-layer DiT with hidden size 1536; the base model was originally for text-to-video generation; the video VAE is from Wan2.1, the audio VAE from AudioLDM2, and the text encoder is Wan2.1’s umT5-xxl.
The training scale is explicitly described as about 1M public data entries in total. The breakdown is 780K audio-text pairs for audio pretraining, 330K audio-video-text triplets for audio-video SFT, and 25K preference pairs for AV-DPO (Liu et al., 22 Feb 2026). The paper emphasizes this as a relatively efficient training scale compared with proprietary systems.
The optimization procedure is organized into three stages:
- Audio pretraining: train audio FFN, embedder, and head on 780K audio-text pairs for 50 epochs with learning rate .
- Audio-video SFT: train LoRA modules on 330K triplets with learning rate ; the detailed schedule reports 2 epochs, while the main narrative mentions 1 epoch for the core SFT setting.
- Audio-video DPO: keep LoRA and train on 25K preference pairs with learning rate (Liu et al., 22 Feb 2026).
The appendix notes that works better for video, works better for audio, and DPO learning rate gives the best convergence (Liu et al., 22 Feb 2026). A plausible implication is that the optimization problem is modality-asymmetric, requiring different preference sharpness settings for the video and audio branches rather than a single global preference coefficient.
3. MS-MoE: modality-specific mixture-of-experts
MS-MoE, or modality-specific mixture-of-experts, is the first of the three principal contributions in JavisDiT++ (Liu et al., 22 Feb 2026). It is not described as a classic routing MoE. Instead, it deterministically assigns audio tokens to an audio FFN and video tokens to a video FFN. The architecture works by flattening and concatenating audio and video tokens, applying shared multi-head self-attention so that audio and video can exchange information, splitting tokens by modality, and then sending them to modality-specific FFN branches.
The paper characterizes this as a compact alternative to dual-stream models. Its central design principle is a separation of functions: attention layers handle cross-modal interaction, while modality-specific FFNs handle intra-modal aggregation. This is presented as important because excessive mixing in the FFN stage can cause one modality to interfere with the other; MS-MoE preserves modality-specific feature modeling capacity after cross-modal information has already been shared (Liu et al., 22 Feb 2026).
The problem it is intended to solve is explicitly connected to earlier unified designs such as UniForm, which use a single shared FFN for all tokens and are said to struggle when extending a pretrained text-to-video model to audio-video generation. In contrast, MS-MoE is described as preserving strong video generation quality, improving audio quality, avoiding the inefficiency of dual-backbone systems, and increasing capacity without increasing per-token inference cost. The key efficiency statement is precise: total parameters rise from 1.3B to 2.1B, but the activated parameters per token remain 1.3B, so inference overhead stays low (Liu et al., 22 Feb 2026).
Ablation evidence on JavisBench-mini compares Shared-DiT + LoRA, Shared-DiT + Full-FT, and MS-MoE. The reported conclusion is that Shared-DiT + LoRA has insufficient capacity and poor audio quality and consistency; Shared-DiT + Full-FT harms video quality by over-updating shared parameters; and MS-MoE yields the best balance, improving both single-modal quality and audiovisual synchrony (Liu et al., 22 Feb 2026). The reported MS-MoE results are:
These numbers are presented as evidence for a specific trade-off resolution: preserving video quality while adding high-quality audio and better synchrony. Relative to JavisDiT’s earlier design, which used Spatio-Temporal Self-Attention, Spatio-Temporal Cross-Attention, and Multi-Modality Bidirectional Cross-Attention in a two-branch DiT backbone (Liu et al., 30 Mar 2025), JavisDiT++ moves toward a more compact unified architecture while retaining explicit mechanisms for cross-modal interaction.
4. TA-RoPE: explicit temporal alignment by positional encoding
TA-RoPE denotes Temporal-Aligned Rotary Position Encoding (Liu et al., 22 Feb 2026). It is designed to establish explicit frame-level synchronization between audio and video tokens by aligning their position IDs on a shared temporal axis. The paper contrasts this strategy with ST-Prior from JavisDiT and frame-level cross-attention in UniVerse-1, describing those alternatives as indirect mechanisms for modulating synchrony.
For video tokens, the model uses the usual 3D RoPE:
and queries and keys are rotated by these position encodings:
For audio, the mel-spectrogram is treated like a 2D structure plus a temporal axis aligned to video time. For an audio token at timestamp and frequency bin , the position ID is defined as:
0
The interpretation provided in the paper is explicit. The term 1 aligns audio time steps to video frames; 2 shifts audio positions in the second dimension; and 3 shifts audio positions in the third dimension. These offsets make audio and video position IDs non-overlapping while keeping them aligned along the temporal axis (Liu et al., 22 Feb 2026).
The appendix compares several schemes—Vanilla, Interpolate, Interleave, and Interleave + Offset, the latter being the final TA-RoPE. The reported findings are that preserving integer audio positions matters for audio quality, temporal alignment helps synchrony, and avoiding overlapping IDs improves video quality and disentangles modalities (Liu et al., 22 Feb 2026). The paper’s stated insight is therefore that alignment alone is not enough; alignment plus non-overlap is better.
An important systems-level point is that TA-RoPE adds zero inference cost because it is implemented through position ID manipulation rather than extra computation (Liu et al., 22 Feb 2026). Because Wan2.1 uses full attention, the temporal arrangement can be emulated without physically reordering tokens. This contrasts with the prior JavisDiT framework, where synchronization depended on learned priors injected via spatio-temporal cross-attention and bidirectional interaction modules (Liu et al., 30 Mar 2025).
The reported comparison with other synchronization mechanisms is as follows:
| Mechanism | JavisScore | DeSync | Latency |
|---|---|---|---|
| None | 0.142 | 0.942 | 1m4s |
| ST-Prior | 0.145 | 0.863 | 1m10s |
| FrameAttn | 0.124 | 0.850 | 1m22s |
| TA-RoPE | 0.153 | 0.807 | 1m4s |
Within this comparison, TA-RoPE gives the best synchrony with no added runtime (Liu et al., 22 Feb 2026). A plausible implication is that explicit temporal indexing can substitute for more computation-heavy synchronization modules when the backbone already supports sufficiently expressive full attention.
5. AV-DPO: preference alignment for joint audio-video generation
AV-DPO, or audio-video direct preference optimization, is the preference-alignment stage of JavisDiT++ (Liu et al., 22 Feb 2026). The paper presents it as tuning the model with pairwise preference data so that outputs better match human judgments of quality, consistency, and synchrony, and states that this is the first use of preference alignment for JAVG in the paper’s framing.
Preference data are constructed from a 30K prompt pool not overlapping with SFT data. For each prompt, the procedure generates 4 samples from the reference model, adds the ground-truth sample to stabilize preference learning, and scores all candidates using multiple reward models (Liu et al., 22 Feb 2026). The reward models are organized by modality-aware dimensions:
| Dimension | Reward models |
|---|---|
| Audio reward | AudioBox; ImageBind for text-audio alignment |
| Video reward | VideoAlign; ImageBind for text-video alignment |
| Audio-video alignment | ImageBind for AV semantic similarity; Syncformer for temporal synchrony |
The paper states that scores for each metric are normalized and averaged within each modality-aware dimension, after which winner-loser pairs are selected where the winner is better across the relevant dimensions. This yields about 25K preference pairs (Liu et al., 22 Feb 2026).
The AV-DPO objective is defined as a pairwise DPO-style objective over audio and video branches. The differences between the policy and reference model on winner-versus-loser pairs are:
5
6
with analogous 7 and 8 for audio. The loss is:
9
where 0 is the sigmoid and 1 and 2 control preference sharpness for video and audio. A regular flow-matching loss is also kept to avoid overfitting (Liu et al., 22 Feb 2026).
The paper emphasizes that pair selection must be modality-consistent. If a pair has a better video but worse audio, that can conflict with the intended objective; modality-aware ranking is therefore described as essential (Liu et al., 22 Feb 2026). The ablation findings state that naive averaging strategies help only a little, modality-aware ranking gives better improvements, normalization matters, and including ground-truth samples improves pair quality.
This preference-alignment stage marks an important distinction from JavisDiT, which focused on synchronized generation through latent spatio-temporal priors learned by contrastive synchronization training but did not include a human-preference optimization stage (Liu et al., 30 Mar 2025). This suggests an expansion of the JAVG objective from physical or semantic synchrony toward explicitly optimized user-perceived realism and audiovisual harmony.
6. Evaluation protocol, empirical performance, and comparative significance
The main benchmark for JavisDiT++ is JavisBench, with 10,140 prompts for full evaluation and 1,000 prompts in JavisBench-mini for ablations (Liu et al., 22 Feb 2026). All models generate 240p, 4-second sounding videos. JavisBench itself originated in JavisDiT as a benchmark of 10,140 high-quality text-captioned sounding videos spanning diverse scenes and complex real-world scenarios (Liu et al., 30 Mar 2025).
The paper evaluates 11 metrics across several dimensions: FVD and FAD for quality; TV-IB, TA-IB, CLIP, and CLAP for text consistency; AV-IB and AVHScore for AV semantic consistency; and JavisScore and DeSync for AV synchrony (Liu et al., 22 Feb 2026). JavisScore was introduced in JavisDiT as a synchronization metric designed to capture semantic synchronization over time and reported to outperform AV-Align on a 3,000-sample validation set, with AUROC 0.6533 and Accuracy 0.7514 versus AV-Align’s AUROC 0.5296 and Accuracy 0.5254 (Liu et al., 30 Mar 2025).
The baselines in the JavisDiT++ evaluation include TempoTkn, TPoS, ReWaS, SeeHear, FoleyC, MMAudio, MM-Diff, JavisDiT, and UniVerse-1, with additional qualitative comparison against commercial Veo3 (Liu et al., 22 Feb 2026). On JavisBench for 240p4s generation, JavisDiT++ achieves the best results among open-source methods in the paper’s main comparison. The reported scores are:
| Metric | JavisDiT++ |
|---|---|
| FVD | 141.5 |
| FAD | 5.5 |
| TV-IB | 0.282 |
| TA-IB | 0.164 |
| CLIP | 0.316 |
| CLAP | 0.424 |
| AV-IB | 0.198 |
| AVHScore | 0.184 |
| JavisScore | 0.159 |
| DeSync | 0.832 |
| Runtime | 10s |
These are stated to surpass JavisDiT and UniVerse-1, while doing so with lower parameter count than UniVerse-1 and much less complexity than dual-stream systems (Liu et al., 22 Feb 2026). The paper further states that JavisDiT++ closes much of the gap to Veo3 in qualitative comparisons, though not necessarily fully surpassing it.
For historical comparison, JavisDiT reported on JavisBench FVD 203.2, KVD 1.4, FAD 6.9, TA-IB 0.197, CLIP 0.325, CLAP 0.320, and JavisScore 0.158 (Liu et al., 30 Mar 2025). The proximity of JavisScore values between the two systems, together with the stronger FVD and FAD of JavisDiT++, suggests that the newer framework prioritizes joint gains across quality, consistency, and synchrony rather than maximizing synchrony in isolation. This interpretation is consistent with the paper’s broader claim that quality, consistency, and synchrony are not automatically aligned and must be optimized jointly but carefully (Liu et al., 22 Feb 2026).
Human evaluation further supports the model’s reported improvements. The user study uses 100 prompts, 3 volunteers, and blind win-tie-lose pairwise judgments. The paper reports that JavisDiT++ beats JavisDiT and UniVerse-1 by more than 70% in human preference, and that AV-DPO alone improves human preference by over 25% (Liu et al., 22 Feb 2026). The stated significance is that the gains are not merely metric-driven.
7. Ablation findings, trade-offs, and broader implications
The ablation studies in JavisDiT++ are presented as revealing several trade-offs (Liu et al., 22 Feb 2026). For MS-MoE, too little adaptation capacity leads to poor audio, whereas too much full fine-tuning on shared weights harms video; modality-specific FFNs provide the best balance. For TA-RoPE, implicit synchrony methods help but add latency, whereas TA-RoPE improves synchrony with no extra runtime; overlapping position IDs hurt video quality, and alignment should be explicit and non-overlapping. For AV-DPO, preference alignment improves subjective quality and synchrony, but if rewards are not modality-aware, improvements are inconsistent; low beta can overfit bad preference data, high beta can preserve reference behavior too strongly, and the best setting depends on modality (Liu et al., 22 Feb 2026).
These findings situate JavisDiT++ in continuity with, but also in contrast to, JavisDiT. The earlier system argued that simple token sharing was not enough and that fine-grained prior-guided cross-attention was the key piece for synchronization in real-world complex scenes (Liu et al., 30 Mar 2025). JavisDiT++ does not reject that diagnosis so much as reformulate it: instead of relying on dedicated prior-estimation modules and heavier interaction blocks, it uses a unified architecture in which attention and FFN specialization are cleanly separated, temporal synchrony is encoded directly into positional indices, and subjective quality is addressed through explicit preference optimization (Liu et al., 22 Feb 2026).
Common misconceptions are directly addressed by the paper’s design logic. One misconception is that shared token streams are sufficient for JAVG if the model is large enough; the MS-MoE ablation is used to argue that shared FFNs can degrade modality-specific quality. Another is that synchronization can be solved only by adding explicit interaction modules such as frame-level cross-attention or prior networks; the TA-RoPE results argue that positional encoding design can produce better synchrony at zero inference cost. A third is that technically synchronized outputs are automatically preferred by humans; AV-DPO is introduced precisely because perceptual quality, harmony between modalities, and subjective preference remain imperfect even when synchronization metrics improve (Liu et al., 22 Feb 2026).
The bottom-line significance attributed to JavisDiT++ is that it provides a simple, efficient, and open-source-friendly recipe for native joint audio-video generation that is stronger than previous open-source systems in both objective metrics and human judgment (Liu et al., 22 Feb 2026). Its key strengths are summarized as follows: MS-MoE lets audio and video interact while preserving modality-specific quality; TA-RoPE gives explicit, frame-level temporal synchronization without extra inference cost; and AV-DPO aligns the model with human preferences across quality, consistency, and synchrony. Built on Wan2.1-1.3B-T2V and trained with only about 1M public samples, it is presented as setting a new open-source benchmark while narrowing the gap with commercial systems such as Veo3 (Liu et al., 22 Feb 2026).