RealDPO: Real or Not Real, that is the Preference (2510.14955v1)

Published 16 Oct 2025 in cs.CV and cs.AI

Abstract: Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.

Summary

The paper introduces RealDPO, a framework that aligns video generative models using real win samples to overcome reward hacking and overfitting issues.
It employs a tailored Direct Preference Optimization loss in diffusion models to enhance motion smoothness and human action consistency.
The approach is validated on the RealAction-5K dataset, demonstrating superior visual, textual, and motion quality compared to existing methods.

RealDPO: Preference Alignment for Video Generation via Real Data

Introduction and Motivation

The paper introduces RealDPO, a preference alignment paradigm for video generative models that leverages real-world data as positive samples in Direct Preference Optimization (DPO) training. The motivation stems from the persistent challenge in video synthesis: generating complex, natural, and contextually consistent human motions. Existing diffusion-based video models, even state-of-the-art DiT architectures, often produce unrealistic or unnatural movements, especially in human-centric scenarios. Traditional supervised fine-tuning (SFT) on curated datasets provides limited corrective feedback and is prone to overfitting, while reward-model-based preference learning suffers from reward hacking, scalability issues, and bias propagation.

Figure 1: RealDPO leverages real data as win samples for preference learning, circumventing reward model limitations and hacking issues.

RealDPO Framework and Loss Formulation

RealDPO extends the DPO paradigm to video diffusion models by using real videos as win samples and synthetic outputs as lose samples. This approach directly addresses the distributional errors of pretrained generative models and eliminates the need for external reward models, thus avoiding reward hacking and bias propagation. The framework is built upon a tailored DPO loss for diffusion-based transformers, inspired by Diffusion-DPO, and is designed to efficiently align model outputs with human preferences.

Figure 2: The RealDPO framework utilizes real data for preference alignment, with a custom DPO loss and reference model update strategy.

The DPO loss for diffusion models is formulated as:

$L_{DPO}(\theta) = - \mathbb{E} \left[ \log\sigma \left(-\beta T \omega(\lambda_t) \left( \|\boldsymbol x_0^w -\boldsymbol {\hat x}_0^w \|^2_2 - \|\boldsymbol x_0^w - \boldsymbol {\tilde x}_0^w \|^2_2 - \left( \| \boldsymbol x_0^l -\boldsymbol{\hat x}_0^l\|^2_2 - \|\boldsymbol x_0^l - \boldsymbol {\tilde x}_0^l\|^2_2\right) \right)\right) \right]$

where $x_0^w$ / $x_0^l$ are the original win/lose samples, $\hat x_0^w$ / $\hat x_0^l$ are the predicted latents by the training model, and $\tilde x_0^w$ / $\tilde x_0^l$ are the predicted latents by the reference model. The reference model is updated via EMA to prevent over-optimization.

Implementation Details

Data Pipeline: RealAction-5K, a curated dataset of 5,000 high-quality videos of daily human activities, is used for win samples. Data is filtered using Qwen2-VL and manually inspected for quality.
Negative Sampling: Synthetic videos are generated with diverse initial noise and paired with the same prompt as the win sample.
Training: The DPO loss is computed for each win-lose pair, and the reference model is periodically updated via EMA.
Computational Efficiency: Offline negative sampling and latent-space training reduce pixel-space decoding overhead, enabling scalable training on high-resolution videos.

Quantitative and Qualitative Results

Extensive experiments demonstrate that RealDPO achieves superior performance in video quality, text alignment, and motion realism compared to SFT, reward-model-based methods (LiFT, VideoAlign), and pretrained baselines.

User Study: RealDPO outperforms all baselines in Overall Quality, Visual Alignment, Text Alignment, Motion Quality, and Human Quality.
MLLM Evaluation: Using Qwen2-VL, RealDPO matches or exceeds baselines in all dimensions, with particularly strong results in motion and human quality.
VBench-I2V Metrics: RealDPO achieves competitive scores across subject consistency, background consistency, motion smoothness, and aesthetic quality.
Figure 3: RealDPO generates more natural motion compared to SFT, as evidenced by qualitative comparisons.

Figure 4: RealDPO alignment significantly improves the naturalness and consistency of generated actions.

Figure 5: RealDPO demonstrates superior visual and semantic alignment compared to reward-model-based methods.

RealAction-5K Dataset

The RealAction-5K dataset is a key contribution, providing high-quality, diverse, and well-annotated videos of human actions. The dataset is constructed via a multi-stage pipeline: keyword-based collection, LLM-based filtering, manual inspection, and automated captioning.

Figure 6: Overview of RealAction-5K, including sample diversity, data processing, and caption statistics.

Trade-offs and Limitations

Data Efficiency: RealDPO requires fewer high-quality samples than SFT, but the quality of real data is critical for effective alignment.
Model Constraints: The effectiveness of RealDPO is bounded by the expressiveness of the underlying video generative model.
Generalization: While RealDPO excels in human action synthesis, its extension to other domains (e.g., non-human motion, abstract scenes) requires further investigation.

Implications and Future Directions

RealDPO advances the upper bound of preference alignment in video generation by directly leveraging real data, offering a scalable and robust alternative to reward-model-based methods. The paradigm is particularly suited for complex motion synthesis, where reward models are insufficient. Future work may explore:

Domain Extension: Adapting RealDPO to broader video domains, including multi-agent and non-human scenarios.
Automated Data Curation: Integrating more sophisticated LLMs for automated filtering and annotation.
Hybrid Alignment: Combining real-data preference learning with weak reward models for domains lacking sufficient real data.

Conclusion

RealDPO presents a data-efficient, robust framework for preference alignment in video generation, leveraging real-world data as win samples and a tailored DPO loss for diffusion-based transformers. The approach demonstrably improves motion realism, text alignment, and overall video quality, outperforming SFT and reward-model-based methods. The introduction of RealAction-5K further supports scalable and effective training. RealDPO sets a new standard for preference alignment in complex motion video synthesis and provides a foundation for future research in multimodal generative modeling.