Papers
Topics
Authors
Recent
2000 character limit reached

DNA-Train: Duality-Normalized Advantage Training

Updated 2 January 2026
  • DNA-Train is a training paradigm for MLLMs that integrates dual-domain contrastive data with a two-phase SFT-RL strategy to mitigate visual ungrounded hallucinations.
  • It employs explicit duality-normalized advantage computation to balance learning signals between real and counterfactual video-QA pairs, ensuring stable policy optimization.
  • Empirical results show significant improvements, including up to a 27.8 percentage point reduction in hallucination rates on specialized benchmarks.

Duality-Normalized Advantage Training (DNA-Train) is a training paradigm for Multimodal LLMs (MLLMs) aimed at mitigating visual ungrounded hallucinations—particularly in counterfactual video understanding—by integrating dual-domain contrastive data, a two-phase SFT-RL regime, and explicit advantage normalization across paired video domains. It was introduced in the context of the DualityForge pipeline and the DualityVidQA benchmark as a principled method to balance learning signals between real and edited (counterfactual) video–question–answer pairs (Huang et al., 30 Dec 2025).

1. Motivation and Conceptual Underpinnings

MLLMs, when fine-tuned exclusively with supervised methods, exhibit a pronounced reliance on language priors, generating answers that may be plausible yet lack empirical visual grounding. This overreliance particularly surfaces when models confront visually plausible but counterfactual video content (i.e., anomaly-inducing edits that violate commonsense or usual semantics). Standard reinforcement learning (RL) approaches such as GRPO and DAPO, though potentially rewarding accurate grounding, suffer from learning instability and domain collapse—typically overfitting to the majority domain or underleveraging counterfactual distinctions.

DNA-Train addresses these deficiencies via a two-stage strategy:

  • Stage 1—Supervised Fine-Tuning (SFT): The model learns from a balanced set of real and edited video-QA instances to instill counterfactual awareness using token-level cross-entropy loss.
  • Stage 2—Reinforcement Learning with Duality-Normalized Advantages: RL is conducted on paired data, with domain-specific advantage signals normalized to achieve balanced, stable policy optimization between real and counterfactual scenarios.

2. Training Workflow and Algorithmic Details

DNA-Train comprises two distinct phases:

2.1 Supervised Fine-Tuning (SFT)

  • Dataset: DualityVidQA-SFT split, consisting of 104,000 QA pairs (54,000 real, 50,000 counterfactual), ensuring a 1:1 class balance.
  • Loss: Standard cross-entropy over all caption-question-answer tokens,

LSFT(θ)=i=1Nlogpθ(yixi)L_{\mathrm{SFT}}(\theta) = -\sum_{i=1}^{N} \log p_{\theta}(y_i|x_i)

where (xi,yi)(x_i, y_i) represents the sequence pair.

  • Batching: Batches strictly contain equal proportions of real and edited samples to avoid source bias and guide the model toward counterfactual sensitivity.

2.2 Reinforcement Learning with Duality-Normalized Advantage

  • Sampling: For each prompt (q,v)(q, v), GG outputs {oi}i=1G\{o_i\}_{i=1}^{G} are sampled from the live policy πθold\pi_{\theta_{\rm old}}; GG is set to 16.
  • Reward: Deterministic reward Ri=rc+rfR_i = r_c + r_f, with rc=1r_c = 1 for exactly correct answers (0 otherwise), and rfr_f a minor format bonus.
  • Token-Level Advantage: For sequence oio_i of length oi|o_i|,

A^i,t=RiRˉσR\hat{A}_{i, t} = \frac{R_i - \bar{R}}{\sigma_R}

with Rˉ\bar{R} and σR\sigma_R the sample mean and standard deviation across GG.

  • Per-Sequence 1\ell_1 Advantage: Aggregate as

A^i=1oit=1oiA^i,t,S=Gi=1GA^i\hat{A}_i = \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \hat{A}_{i, t}, \qquad S = |G| \cdot \sum_{i=1}^{G} |\hat{A}_i|

With binary RiR_i, S=2Rˉ(1Rˉ)S = 2\sqrt{\bar{R}(1-\bar{R})} holds.

  • Dual-Group Normalization: Batches contain GRG_R (real) and GCFG_{\rm CF} (counterfactual) groups with raw signals SRS_R, SCFS_{\rm CF}. Compute

Starget=SR+SCF2,αR=Starget/SR,αCF=Starget/SCFS_{\rm target} = \frac{S_R + S_{\rm CF}}{2}, \quad \alpha_R = S_{\rm target}/S_R, \quad \alpha_{\rm CF} = S_{\rm target}/S_{\rm CF}

and rescale per group: A~i,t=αA^i,t\tilde{A}_{i, t} = \alpha_* \cdot \hat{A}_{i, t}, with =R* = R or CF{\rm CF}.

  • Policy Gradient: The update is

θJ(θ)=E(q,a),{oi}πold[1ioii=1Gt=1oiA~i,tθlogπθ(oi,tq,oi,<t)]\nabla_{\theta} J(\theta) = \mathbb{E}_{(q, a), \{o_i\} \sim \pi_{\text{old}}} \left[ \frac{1}{\sum_i |o_i|} \sum_{i=1}^{G} \sum_{t=1}^{|o_i|} \tilde{A}_{i, t} \nabla_{\theta} \log \pi_{\theta}(o_{i,t}|q, o_{i,<t}) \right]

  • Training Parameters: RL learning rate 1×1061 \times 10^{-6}; batch size $64$; RL steps dependent on the base model size (600 steps for 7B).

3. Integration with DualityForge and DualityVidQA

DNA-Train is designed around the DualityForge pipeline, which provides minimally edited, diffusion-based video pairs with structured real-versus-counterfactual QA annotations. DNA-Train employs the following data splits:

  • DualityVidQA-SFT: For SFT (104K pairs).
  • DualityVidQA-RL: For RL (20K shared-question contrastive pairs, 40K total).
  • DualityVidQA-Test: For benchmarking hallucination reduction (600 manually curated pairs across four anomaly categories).

The overall workflow harnesses DualityForge’s automated, anomaly-focused editing and QA synthesis to generate challenging visual anomalies, bridging the intrinsic data imbalance between text-dominated and video-grounded phenomena.

4. Hyperparameters and Implementation Considerations

The main hyperparameters for DNA-Train are as follows:

Stage Learning Rate Batch Size Sampling / Steps
SFT 1×1061 \times 10^{-6} 4 1 epoch; 8×\timesH200 GPUs
RL 1×1061 \times 10^{-6} 64 GG=16 samples/prompt; steps=600 (7B), 60 (32B), 20 (72B)

DAPO clipping parameters (ϵlow,ϵhigh\epsilon_{\text{low}}, \epsilon_{\text{high}}) are inherited unchanged from the baseline DAPO method.

This suggests that DNA-Train remains robust across model scales and can be adapted to various base MLLMs without altering foundational optimization parameters.

5. Empirical Results and Ablation Analysis

DNA-Train demonstrably reduces MLLM hallucinations and achieves general improvements on both specialized and broad video QA benchmarks.

  • DualityVidQA-Test: DNA-Train (7B) achieves 76.8%76.8\% (vs. Qwen2.5-VL-7B at 52.8%52.8\%, a +24.0 percentage point improvement).
  • EventHallusion: 61.3%61.3\% (vs. 33.5%33.5\%, +27.8 pp).
  • General Purpose QA (examples): TempCompass (+2.1+2.1 pp), MVBench (+1.2+1.2 pp), TOMATO (+5.8+5.8 pp), TVBench (+1.3+1.3 pp).

Ablation studies provide critical insight:

  • Advantage Normalization: On DualityVidQA-Test, DNA-Train surpasses both GRPO (74.6%74.6\%) and vanilla DAPO (74.8%74.8\%) by +2.0 pp and yields a +9.5 pp mean improvement on hallucination benchmarks.
  • Paired vs. Single-Domain RL: Training on only real or counterfactual data causes sharp drops in counterfactual performance (real-only: 29.0%; CF-only: 13.1%; paired: 70.6%).

These results indicate that dual-batch normalization is integral to stable, balanced signal propagation during RL in the presence of strong domain imbalance.

6. Significance and Theoretical Implications

DNA-Train establishes a foundational duality between video-grounded and linguistically plausible reasoning, leveraging paired, controlled data to explicitly penalize hallucinations and reward genuine visual grounding. The duality-normalized advantage computation alleviates the instabilities and learning signal collapse endemic to single-domain or standard RL approaches for MLLMs in counterfactual high-variance regimes.

A plausible implication is that this normalization technique can inform future RL-based training methodologies across other modalities and dual-domain contrastive learning setups where class imbalance or reward variance threaten policy convergence and generalization.

7. Benchmarking Context and Prospective Impact

The DNA-Train framework, in conjunction with the DualityForge pipeline and the DualityVidQA benchmark, directly targets the documented over-reliance of MLLMs on language priors for visual tasks. Open-sourcing efforts for datasets and code further promote reproducibility and external benchmarking. The methodology is complementary to existing RL frameworks but distinct in its explicit, mathematically principled advantage normalization to enforce balanced optimization across paired real and counterfactual visual domains (Huang et al., 30 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Duality-Normalized Advantage Training (DNA-Train).