Audio-Video Direct Preference Optimization
- Audio-Video Direct Preference Optimization is a family of methods that use direct preference optimization to jointly condition and refine audio-video models for better synchrony and fidelity.
- The approach employs modality-aware preference pairs, automated ranking, and rule-based negative constructions to optimize temporal alignment and cross-modal consistency.
- Empirical benchmarks demonstrate that AV-DPO improves multimodal generation metrics, including FVD, FAD, and synchronization scores across various tasks.
Searching arXiv for papers on audio-video direct preference optimization and closely related multimodal DPO variants. Audio-Video Direct Preference Optimization (AV-DPO) denotes a family of post-training methods that apply direct preference optimization to models whose conditioning, outputs, or both are jointly audio-video. In the literature, the term is used most explicitly for joint audio-video rectified-flow generation, where modality-aware preference pairs are used to optimize audio quality, video quality, audio-video consistency, and temporal synchrony relative to a frozen reference model; closely related formulations extend the same logic to video-audio joint generation, video-to-audio synthesis, video captioning, and audio transcription, positioning AV-DPO as a general alignment paradigm rather than a single architecture (Liu et al., 22 Feb 2026, Cheng et al., 12 May 2026, Chan et al., 11 Mar 2026, Tang et al., 2 Jul 2025, Quang et al., 13 May 2026).
1. Scope and problem setting
AV-DPO spans several neighboring problem classes. In joint audio-video generation, the objective is to synthesize synchronized sound and vision from text or other conditions, then align the generator with human or proxy preferences over quality and synchrony. In video-to-audio generation, the objective narrows to audio synthesis conditioned on video, but the preference signal remains inherently audiovisual because the generated audio is judged against the visible scene. In captioning and transcription settings, the output is text, yet preference optimization still targets audio-video faithfulness by rewarding temporally detailed captions, spatially grounded descriptions, or verbatim speech preservation rather than task-inappropriate paraphrase or translation (Liu et al., 22 Feb 2026, Chan et al., 11 Mar 2026, Tang et al., 2 Jul 2025, Quang et al., 13 May 2026).
The present literature therefore uses AV-DPO in both a narrow and a broad sense. Narrowly, it refers to modality-aware DPO objectives defined directly over joint audio and video branches, as in JavisDiT++ (Liu et al., 22 Feb 2026). Broadly, it describes the reuse of DPO machinery in audio-video systems whenever preferences are defined by audiovisual grounding, temporal alignment, or multimodal faithfulness, even if the model outputs only audio or only text (Cheng et al., 12 May 2026, Yang et al., 24 Sep 2025).
| Work | Setting | Preference signal |
|---|---|---|
| JavisDiT++ (Liu et al., 22 Feb 2026) | Joint audio-video generation | Winner must exceed loser in audio, video, and AV dimensions |
| SyncDPO (Cheng et al., 12 May 2026) | Video-audio joint generation | Aligned sample vs temporally distorted negative |
| V2A-DPO (Chan et al., 11 Mar 2026) | Flow-based video-to-audio | AudioScore best-vs-worst pairs |
| MultiSoundGen (Yang et al., 24 Sep 2025) | Video-to-audio | SF-CAVP ranks candidates; GT is winner |
| AVC-DPO (Tang et al., 2 Jul 2025) | Video captioning | Base caption vs aspect-enhanced caption |
| English-Mandarin audio DPO (Quang et al., 13 May 2026) | Code-switching ASR | Ground truth vs synthetic corruption |
A plausible implication is that AV-DPO is best understood as a multimodal alignment pattern: define modality-relevant failure modes, construct pairwise or ranked preferences, anchor updates to a reference policy, and optimize toward preferred outputs without explicit online RL.
2. Core objective formulations
The most general formulation in text-generating multimodal models follows standard reference-model DPO. In video captioning, AVC-DPO defines
with a trainable multimodal model, a frozen reference model, and preferred and dispreferred captions generated under different prompt conditions (Tang et al., 2 Jul 2025). The English-Mandarin code-switching study uses the same standard DPO structure for audio-conditioned transcription, with as audio plus prompt, as the ground-truth code-switching transcript, and as a synthetic corruption (Quang et al., 13 May 2026).
In joint audio-video generation, JavisDiT++ adapts DPO to rectified-flow models by replacing log-probabilities with flow-matching residuals. It defines separate video and audio residual differences for the policy and reference branches, then combines them in a single logistic objective:
Here the winner must be preferred to the loser in both modality-specific branches while updates remain anchored to the frozen SFT reference model (Liu et al., 22 Feb 2026).
HuViDPO derives the analogous diffusion-model objective over full video trajectories, expressing the DPO comparison in terms of differences between model and reference noise-prediction errors along the diffusion path. MultiSoundGen and V2A-DPO transfer the same principle to conditional flow matching for video-to-audio generation, again using a reference model and squared residual differences as the surrogate for preference-aware likelihood ratios (Jiang et al., 2 Feb 2025, Yang et al., 24 Sep 2025, Chan et al., 11 Mar 2026).
A central theoretical refinement concerns diffusion and flow models themselves. "Beyond Reward Margin" identifies likelihood displacement, where the probabilities of chosen samples can decrease during DPO training, and attributes this to two failure modes: Optimization Conflict for small reward margins and Suboptimal Maximization for large reward margins. Its proposed Policy-Guided DPO introduces Adaptive Rejection Scaling and Implicit Preference Regularization to mitigate these effects in video generation (Xu et al., 24 Nov 2025). This suggests that AV-DPO is not merely a data-construction problem; objective design can materially affect whether preference alignment improves or destabilizes multimodal generation.
3. Preference data construction
A defining feature of AV-DPO is that the quality of the preference dataset often dominates the quality of the alignment result. The literature exhibits three main strategies: modality-aware automated ranking, rule-based negative construction, and prompt-conditioned synthetic preference generation.
Modality-aware automated ranking is exemplified by JavisDiT++. It begins from a prompt pool of 30k captions; for each prompt, the Stage 2 SFT reference model generates candidate audio-video pairs, and the ground-truth sample from TAVGBench is also added. Each candidate is evaluated with multiple reward models: AudioBox and ImageBind text-audio similarity for audio, VideoAlign and ImageBind text-video similarity for video, and ImageBind audio-video similarity plus Syncformer for audio-video alignment. Scores are normalized per metric, averaged within the three dimensions , and a pair is accepted only if the winner is better than the loser in all three aggregated dimensions. This yields approximately 25k preference pairs, with about 30% of winners being model outputs (Liu et al., 22 Feb 2026).
Rule-based negative construction is the central contribution of SyncDPO. Rather than sampling and ranking multiple candidates, it perturbs aligned audiovisual pairs on the fly using Scaling, Replacing, Shifting, Masking, and Synthesizing. Scaling rescales the temporal dimension with ; Shifting applies a global temporal offset ; Masking uses a mask ratio . The winner is the synchronized original sample and the loser is a temporally distorted counterpart. SyncDPO further applies a curriculum
0
with 1, shifting training from coarse Replacing negatives to subtler Scaling negatives (Cheng et al., 12 May 2026).
Scoring-system-driven pair mining is the pattern used by V2A-DPO. AudioScore combines five metrics—ImageBind video-audio semantic consistency, CLAP text-audio semantic consistency, DeSync temporal alignment, PANNs-based Inception Score, and PESQ for speech—into a 5D vector, then passes it through a 2-layer MLP trained on human labels "Good", "Medium", and "Bad". The preference pipeline samples 2 audios per video, computes AudioScore probabilities, chooses the winner by maximizing 3 and the loser by maximizing 4, and constructs approximately 46K automatic pairs from 50K videos. It then adds 2K human-annotated pairs for a final DPO dataset of approximately 48K pairs. Pair difficulty is measured by a complexity score derived from the Good/Bad probability gap and used for a two-stage curriculum with human pairs forced into the second stage (Chan et al., 11 Mar 2026).
AV representation models as reward models appear in MultiSoundGen. Its SF-CAVP reward model computes segment-level audio-video cosine similarities and aggregates the mean of the lowest quarter of similarities into a scalar alignment score. For each video, 5 generated audios are ranked by this score, the ground-truth audio is used as the winner, and the lowest-scoring generated audio is used as the loser (Yang et al., 24 Sep 2025).
Prompt-conditioned synthetic preferences dominate captioning and transcription variants. AVC-DPO generates a base caption under a general query 6 and improved candidates under temporal or spatial enhanced prompts, then uses Qwen2.5-VL-72B as a grader to keep only pairs whose score difference exceeds 7 (Tang et al., 2 Jul 2025). The code-switching audio DPO study uses the ground-truth code-switched transcript as the chosen response and a Qwen3-32B-produced synthetic translation as the rejected response, with 80% Global Translation and 20% Partial Translation across 100,766 preference pairs and 566.8 hours of audio (Quang et al., 13 May 2026).
4. Architectural integration and training pipelines
AV-DPO has so far been implemented primarily as a post-training procedure on top of already competent multimodal backbones. The recurrent pattern is to keep a frozen reference snapshot, update a trainable copy or LoRA-adapted variant, and concentrate preference learning on temporal and cross-modal modules rather than re-learning the full data distribution from scratch.
JavisDiT++ uses a unified token sequence in which audio and video tokens are flattened, concatenated, and processed by shared self-attention, while modality-specific FFNs are implemented through an MS-MoE design. TA-RoPE assigns temporally aligned position IDs across modalities, and AV-DPO is applied only after an audio pre-training stage and an audio-video SFT stage. In Stage 3, only LoRA parameters are trainable—about 121M parameters—with learning rate 8, 100 warmup steps, 1 epoch on 25k preference samples, and a compute budget of about 3 H100 GPU-days (Liu et al., 22 Feb 2026).
SyncDPO is built on Ovi, described as a twin-backbone cross-modal fusion flow-matching model over joint video and audio latents. It fine-tunes the flow transformer with LoRA, using rank 128 in the main experiments, optimizer Adam, learning rate 9, cosine decay with 1000 warmup steps, EMA decay 0.9, and 4 × NVIDIA A800 GPUs. SFT uses 2 epochs with batch size 2, while DPO and SyncDPO use 1 epoch with batch size 1 to make total gradient steps and wall-clock time roughly comparable (Cheng et al., 12 May 2026).
HuViDPO, although a text-to-video method rather than a joint audio-video system, is methodologically important because it shows how DPO can be embedded in latent video diffusion with First-Frame-Conditioned training and SparseCausal Attention. The first frame is kept clean, later frames are denoised conditionally, and DPO fine-tuning updates LoRA weights on motion and attention modules while leaving content generation largely anchored by the first-frame model. This decomposition is directly relevant to AV-DPO designs that wish to separate content, motion, and synchrony (Jiang et al., 2 Feb 2025).
MultiSoundGen uses MM-DiT with Conditional Flow Matching. Audio is represented in a VAE latent space derived from mel-spectrograms; conditioning combines CLIP visual features, Synchformer-aligned temporal video features, and CLIP text features through global and frame-aligned conditioning. Its ablations report that full fine-tuning severely degrades performance, whereas partial fine-tuning of the last single-modal transformer layer, adaLNs, and 1D convolutions is effective (Yang et al., 24 Sep 2025).
A consistent cross-paper pattern is therefore visible: AV-DPO is typically reference-anchored, post-SFT, and often parameter-efficient. This suggests that preference optimization is being used chiefly to reshape multimodal decision boundaries—especially around fidelity, synchrony, and modality balance—rather than to replace large-scale generative pretraining.
5. Empirical behavior across tasks
The empirical record shows that AV-DPO-style methods improve different but related multimodal behaviors: temporal synchrony in generation, semantic consistency in video-to-audio synthesis, detail and grounding in captioning, and verbatim faithfulness in transcription.
In JavisDiT++, modality-aware AV-DPO improves the SFT baseline on JavisBench-mini from FVD 221.3 to 198.5, FAD 5.51 to 5.32, TV-IB 0.283 to 0.284, TA-IB 0.163 to 0.168, AV-IB 0.194 to 0.201, JavisScore 0.153 to 0.156, and DeSync 0.807 to 0.776. Human evaluation further reports that DPO-enhanced outputs are preferred in more than 25% more pairwise comparisons than the SFT baseline (Liu et al., 22 Feb 2026).
SyncDPO concentrates on fine-grained temporal alignment. On LRS2, under out-of-domain Koala training, it improves lip-sync from LSE-D/LSE-C of 8.19/7.04 for vanilla DPO to 7.83/7.18. On AVSync15, out-of-domain DeSync improves from 0.75 for DPO and 0.92 for SFT to 0.67 for SyncDPO. On GreatestHits, out-of-domain DeSync improves from 0.38 for DPO to 0.29, and on VABench it improves from 0.68 for DPO and 0.49 for SFT to 0.43 while also improving VA-IB to 0.25. The study explicitly reports that Masking negatives are harmful, while Replacing and Scaling are the best or second-best negative-construction methods depending on domain (Cheng et al., 12 May 2026).
V2A-DPO shows that preference optimization can outperform DDPO in flow-based video-to-audio generation. For Frieren, FD0 improves from 106.10 to 69.98, IS from 12.25 to 13.98, IB-score from 22.78 to 24.11, and DeSync from 0.85 to 0.62. For MMAudio, FD1 improves from 60.60 to 51.38, KL2 from 1.65 to 1.38, KL3 from 1.40 to 1.34, IS from 17.40 to 19.21, IB-score from 33.22 to 34.08, and DeSync from 0.44 to 0.35. The paper states that the DPO-optimized MMAudio achieves state-of-the-art performance across multiple metrics and surpasses published V2A models (Chan et al., 11 Mar 2026).
MultiSoundGen, which frames AVP-RPO as an instance of AV-DPO, reports improvements on the multi-event VGG-SS-M benchmark from MMAudio to MultiSoundGen: FD4 4.100 to 3.716, FD5 39.519 to 39.073, KL 1.256 to 1.244, IS 6.295 to 6.354, IB-score 0.342 to 0.343, and DeSync 0.379 to 0.360. Its qualitative analysis attributes the gains to better handling of multiple sound sources, rapid event transitions, and segment-localized audio-video alignment (Yang et al., 24 Sep 2025).
Captioning and transcription variants show the same preference-optimization logic in text output spaces. AVC-DPO improves VDC average performance from 43.9 / 2.3 for base Qwen2.5-VL-7B to 47.7 / 2.5 for AVC-DPO-temporal-7B and 51.1 / 2.6 for AVC-DPO-spatial-7B, and its ablation reports that prompt engineering alone yields only 49.2 / 2.5 on the Background aspect compared with 54.7 / 2.8 for spatial DPO under the original prompt (Tang et al., 2 Jul 2025). The English-Mandarin code-switching study shows that DPO can produce a large behavioral shift from translation-like outputs to faithful mixed-language transcription, with MER reductions up to 89.6% relative in-distribution and 20.0% out-of-distribution (Quang et al., 13 May 2026).
Taken together, these results indicate that AV-DPO is especially effective when the failure mode is comparative rather than reconstructive: subtle timing errors, modality omission, translation-instead-of-transcription, over-generic captioning, or plausible-but-ungrounded generation are all easier to separate through winner-versus-loser structure than through MSE-style supervised objectives alone.
6. Limitations, debates, and open directions
Several unresolved issues now define the research frontier. The first is whether binary DPO is expressive enough for multimodal faithfulness. "Learning to Rank Caption Chains for Video-Text Alignment" argues that the standard Bradley-Terry winner-takes-all formulation is suboptimal for video-LLMs because a losing caption may still be visually faithful. It proposes totally ordered caption chains and a Plackett-Luce ranking objective, and reports that ranking optimization outperforms binary DPO for long-form content generation and assessment. It also finds that these approaches require finetuning of the vision encoder to be effective, directly challenging the view of DPO as purely a language-reweighting process (Blume et al., 26 Mar 2026). This suggests a plausible extension of AV-DPO toward ranked audiovisual preference chains rather than isolated binary pairs.
A second debate concerns optimization pathologies in diffusion and flow models. "Beyond Reward Margin" shows that DPO can suffer from likelihood displacement in video generation, meaning that chosen-sample probabilities can decrease during training. Its analysis identifies Optimization Conflict for small reward margins and Suboptimal Maximization for large reward margins, then proposes PG-DPO with Adaptive Rejection Scaling and Implicit Preference Regularization. The reported outcome is that PG-DPO outperforms baseline DPO and SFT across CLIPScore, HPS-v2, VideoAlign, temporal flickering, aesthetic quality, VQA, and human study metrics (Xu et al., 24 Nov 2025). For AV-DPO, this is not a peripheral concern: multimodal branches may intensify exactly the same margin-sensitive conflicts.
A third issue is preference-label bias. JavisDiT++ explicitly notes reward model bias, imperfect metrics, and the limitations of a single round of offline AV-DPO on only 25k preference pairs (Liu et al., 22 Feb 2026). V2A-DPO relies on AudioScore, which is itself trained on 2K videos and 10K generated audios and therefore inherits the inductive biases of ImageBind, CLAP, Synchformer, PANNs, and PESQ (Chan et al., 11 Mar 2026). MultiSoundGen uses only automated preferences and notes that larger, higher-quality preference datasets and more fine-grained AV contrastive models would likely improve results (Yang et al., 24 Sep 2025).
A fourth issue is failure-mode coverage. SyncDPO shows that not all negative constructions are equally useful: Masking can severely hurt quality, while Scaling and Replacing are usually better-aligned with the target synchronization objective (Cheng et al., 12 May 2026). The code-switching audio DPO study shows the converse phenomenon: explicitly targeting translation-style corruptions can also reduce hallucination and language omission even without synthesizing those negatives directly (Quang et al., 13 May 2026). A plausible implication is that AV-DPO datasets should not be judged merely by size; they should be judged by whether their negatives capture the dominant error geometry of the target task.
The major open directions are already visible in the existing papers: scaling both preference data and model size; moving from offline automated ranking to richer human-in-the-loop supervision; exploring full-parameter AV-DPO rather than LoRA-only variants; building better reward models for audio-video synchrony and motion-sound causality; extending alignment from short clips to long videos and multi-source audio; and integrating controllability variables such as emotion, speaker identity, rhythm, or style into the preference space (Liu et al., 22 Feb 2026, Chan et al., 11 Mar 2026, Yang et al., 24 Sep 2025, Cheng et al., 12 May 2026). Across these directions, the central technical question remains stable: how to encode multimodal faithfulness, synchrony, and perceptual quality into preference structures that are informative enough to guide generation, yet stable enough not to damage the pretrained model’s underlying competence.