DNA-Train: Duality-Normalized Advantage Training
- DNA-Train is a training paradigm for MLLMs that integrates dual-domain contrastive data with a two-phase SFT-RL strategy to mitigate visual ungrounded hallucinations.
- It employs explicit duality-normalized advantage computation to balance learning signals between real and counterfactual video-QA pairs, ensuring stable policy optimization.
- Empirical results show significant improvements, including up to a 27.8 percentage point reduction in hallucination rates on specialized benchmarks.
Duality-Normalized Advantage Training (DNA-Train) is a training paradigm for Multimodal LLMs (MLLMs) aimed at mitigating visual ungrounded hallucinations—particularly in counterfactual video understanding—by integrating dual-domain contrastive data, a two-phase SFT-RL regime, and explicit advantage normalization across paired video domains. It was introduced in the context of the DualityForge pipeline and the DualityVidQA benchmark as a principled method to balance learning signals between real and edited (counterfactual) video–question–answer pairs (Huang et al., 30 Dec 2025).
1. Motivation and Conceptual Underpinnings
MLLMs, when fine-tuned exclusively with supervised methods, exhibit a pronounced reliance on language priors, generating answers that may be plausible yet lack empirical visual grounding. This overreliance particularly surfaces when models confront visually plausible but counterfactual video content (i.e., anomaly-inducing edits that violate commonsense or usual semantics). Standard reinforcement learning (RL) approaches such as GRPO and DAPO, though potentially rewarding accurate grounding, suffer from learning instability and domain collapse—typically overfitting to the majority domain or underleveraging counterfactual distinctions.
DNA-Train addresses these deficiencies via a two-stage strategy:
- Stage 1—Supervised Fine-Tuning (SFT): The model learns from a balanced set of real and edited video-QA instances to instill counterfactual awareness using token-level cross-entropy loss.
- Stage 2—Reinforcement Learning with Duality-Normalized Advantages: RL is conducted on paired data, with domain-specific advantage signals normalized to achieve balanced, stable policy optimization between real and counterfactual scenarios.
2. Training Workflow and Algorithmic Details
DNA-Train comprises two distinct phases:
2.1 Supervised Fine-Tuning (SFT)
- Dataset: DualityVidQA-SFT split, consisting of 104,000 QA pairs (54,000 real, 50,000 counterfactual), ensuring a 1:1 class balance.
- Loss: Standard cross-entropy over all caption-question-answer tokens,
where represents the sequence pair.
- Batching: Batches strictly contain equal proportions of real and edited samples to avoid source bias and guide the model toward counterfactual sensitivity.
2.2 Reinforcement Learning with Duality-Normalized Advantage
- Sampling: For each prompt , outputs are sampled from the live policy ; is set to 16.
- Reward: Deterministic reward , with for exactly correct answers (0 otherwise), and a minor format bonus.
- Token-Level Advantage: For sequence of length ,
with and the sample mean and standard deviation across .
- Per-Sequence Advantage: Aggregate as
With binary , holds.
- Dual-Group Normalization: Batches contain (real) and (counterfactual) groups with raw signals , . Compute
and rescale per group: , with or .
- Policy Gradient: The update is
- Training Parameters: RL learning rate ; batch size $64$; RL steps dependent on the base model size (600 steps for 7B).
3. Integration with DualityForge and DualityVidQA
DNA-Train is designed around the DualityForge pipeline, which provides minimally edited, diffusion-based video pairs with structured real-versus-counterfactual QA annotations. DNA-Train employs the following data splits:
- DualityVidQA-SFT: For SFT (104K pairs).
- DualityVidQA-RL: For RL (20K shared-question contrastive pairs, 40K total).
- DualityVidQA-Test: For benchmarking hallucination reduction (600 manually curated pairs across four anomaly categories).
The overall workflow harnesses DualityForge’s automated, anomaly-focused editing and QA synthesis to generate challenging visual anomalies, bridging the intrinsic data imbalance between text-dominated and video-grounded phenomena.
4. Hyperparameters and Implementation Considerations
The main hyperparameters for DNA-Train are as follows:
| Stage | Learning Rate | Batch Size | Sampling / Steps |
|---|---|---|---|
| SFT | 4 | 1 epoch; 8H200 GPUs | |
| RL | 64 | =16 samples/prompt; steps=600 (7B), 60 (32B), 20 (72B) |
DAPO clipping parameters () are inherited unchanged from the baseline DAPO method.
This suggests that DNA-Train remains robust across model scales and can be adapted to various base MLLMs without altering foundational optimization parameters.
5. Empirical Results and Ablation Analysis
DNA-Train demonstrably reduces MLLM hallucinations and achieves general improvements on both specialized and broad video QA benchmarks.
- DualityVidQA-Test: DNA-Train (7B) achieves (vs. Qwen2.5-VL-7B at , a +24.0 percentage point improvement).
- EventHallusion: (vs. , +27.8 pp).
- General Purpose QA (examples): TempCompass ( pp), MVBench ( pp), TOMATO ( pp), TVBench ( pp).
Ablation studies provide critical insight:
- Advantage Normalization: On DualityVidQA-Test, DNA-Train surpasses both GRPO () and vanilla DAPO () by +2.0 pp and yields a +9.5 pp mean improvement on hallucination benchmarks.
- Paired vs. Single-Domain RL: Training on only real or counterfactual data causes sharp drops in counterfactual performance (real-only: 29.0%; CF-only: 13.1%; paired: 70.6%).
These results indicate that dual-batch normalization is integral to stable, balanced signal propagation during RL in the presence of strong domain imbalance.
6. Significance and Theoretical Implications
DNA-Train establishes a foundational duality between video-grounded and linguistically plausible reasoning, leveraging paired, controlled data to explicitly penalize hallucinations and reward genuine visual grounding. The duality-normalized advantage computation alleviates the instabilities and learning signal collapse endemic to single-domain or standard RL approaches for MLLMs in counterfactual high-variance regimes.
A plausible implication is that this normalization technique can inform future RL-based training methodologies across other modalities and dual-domain contrastive learning setups where class imbalance or reward variance threaten policy convergence and generalization.
7. Benchmarking Context and Prospective Impact
The DNA-Train framework, in conjunction with the DualityForge pipeline and the DualityVidQA benchmark, directly targets the documented over-reliance of MLLMs on language priors for visual tasks. Open-sourcing efforts for datasets and code further promote reproducibility and external benchmarking. The methodology is complementary to existing RL frameworks but distinct in its explicit, mathematically principled advantage normalization to enforce balanced optimization across paired real and counterfactual visual domains (Huang et al., 30 Dec 2025).