Improving Video Generation with Human Feedback (2501.13918v1)

Published 23 Jan 2025 in cs.CV, cs.AI, cs.GR, and cs.LG

Abstract: Video generation has achieved significant advances through rectified flow techniques, but issues like unsmooth motion and misalignment between videos and prompts persist. In this work, we develop a systematic pipeline that harnesses human feedback to mitigate these problems and refine the video generation model. Specifically, we begin by constructing a large-scale human preference dataset focused on modern video generation models, incorporating pairwise annotations across multi-dimensions. We then introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy. From a unified reinforcement learning perspective aimed at maximizing reward with KL regularization, we introduce three alignment algorithms for flow-based models by extending those from diffusion models. These include two training-time strategies: direct preference optimization for flow (Flow-DPO) and reward weighted regression for flow (Flow-RWR), and an inference-time technique, Flow-NRG, which applies reward guidance directly to noisy videos. Experimental results indicate that VideoReward significantly outperforms existing reward models, and Flow-DPO demonstrates superior performance compared to both Flow-RWR and standard supervised fine-tuning methods. Additionally, Flow-NRG lets users assign custom weights to multiple objectives during inference, meeting personalized video quality needs. Project page: https://gongyeliu.github.io/videoalign.

Summary

The paper introduces a comprehensive pipeline leveraging human preference data and a dedicated reward model to enhance video generation quality and text alignment.
It adapts three alignment techniques—Flow-DPO, Flow-RWR, and Flow-NRG—for rectified flow-based models to optimize visual quality and motion stability.
Experimental results show that using a constant β in Flow-DPO and training a noisy latent reward model in Flow-NRG significantly improves video quality and benchmark performance.

This paper introduces a comprehensive pipeline for improving video generation models, particularly those based on rectified flow, by leveraging human feedback (2501.13918). The core problem addressed is the persistence of issues like unsmooth motion and misalignment between generated videos and text prompts, even in advanced models. The proposed solution involves collecting human preference data, training a reward model, and developing alignment algorithms specifically adapted for flow-based generators.

1. Human Preference Data and Reward Model (VideoReward)

Dataset Construction: A new large-scale dataset was created containing approximately 182k annotated triplets (prompt, video A, video B). Videos were generated using 12 diverse text-to-video models, including modern ones like Luma, Gen3, and Kling. Human annotators provided pairwise preference labels across three dimensions:
- Visual Quality (VQ): Static quality, clarity, detail, aesthetics.
- Motion Quality (MQ): Stability, naturalness, fluidity, dynamic aspects.
- Text Alignment (TA): Relevance to the prompt regarding subject, motion, environment, style.
VideoReward Model:
- A multi-dimensional reward model named VideoReward was developed using Qwen2-VL-2B as the backbone.
- Training: The model was trained using the Bradley-Terry-with-Ties (BTT) formulation (Eq. \ref{eq:btt_loss}), which explicitly models tied preferences and was found to outperform standard Bradley-Terry (BT) and pointwise score regression approaches, especially in distinguishing videos of similar quality (Fig. \ref{fig:comp_bt_btt}).
- Architecture Detail: To decouple context-agnostic rewards (VQ, MQ) from the text prompt, separate special tokens are introduced in the input sequence. VQ/MQ tokens are placed before the prompt, while the TA token is placed after, leveraging the VLM's causal attention mechanism. This allows VQ/MQ scores to depend only on the video, enhancing interpretability and independent evaluation.
VideoGen-RewardBench: A new benchmark consisting of 26.5k annotated video pairs derived from the VideoGen-Eval dataset was created to evaluate reward models specifically on outputs from modern video generators. VideoReward demonstrated superior performance compared to baselines like VideoScore, LiFT, and VisionReward on this benchmark (Tab. \ref{tab:main_result_rm}).

2. Flow-Based Alignment Algorithms

The paper adapts three alignment techniques from reinforcement learning and diffusion models to the rectified flow framework, which predicts velocity fields instead of noise. These algorithms aim to maximize the reward predicted by VideoReward while regularizing against deviation from the original model (KL divergence).

Flow-DPO (Direct Preference Optimization for Flow):
- Adapts Diffusion-DPO to optimize the flow model directly on the collected preference pairs (_0^w \succ _0^l).
- The derived loss function (Eq. \ref{eq:flow_dpo}) relates the DPO objective to the L2 error in predicted velocities.
- Key Implementation Finding: The derivation yields a time-dependent KL regularization coefficient β_t = β(1-t)². However, experiments showed this weighting scheme led to suboptimal performance (especially on TA) and potential reward hacking. Using a constant β across all timesteps yielded significantly better and more stable alignment results (Tab. \ref{tab:multi_dimension_alignment}, Fig. \ref{fig:discussion_beta_ta}). This suggests a constant β is the preferred practical implementation. Pseudocode is provided in Appendix \ref{app:sec:pesudocode}.
Flow-RWR (Reward Weighted Regression for Flow):
- Adapts RWR by weighting the standard flow matching loss for each sample based on its reward.
- The objective (Eq. \ref{eq:flow_rwr}) encourages the model to fit the velocity fields of high-reward videos more closely.
Flow-NRG (Noisy Reward Guidance):
- An inference-time alignment technique based on classifier guidance principles.
- It modifies the predicted velocity at each step of the ODE solve by adding a term proportional to the gradient of the reward function with respect to the noisy intermediate video representation _t (Eq. \ref{eq:reward_guidance}).
- Key Implementation Detail: Calculating the reward gradient directly on _t is crucial. Since models often operate in a latent space, passing gradients through the VAE decoder is computationally expensive. The proposed solution is to train a lightweight, time-dependent reward function directly on noisy latent representations. This reward model can leverage the early layers of the pretrained video generation model. This approach proved effective, whereas using a reward model trained only on clean videos (t=0) failed to provide useful guidance (Tab. \ref{tab:reward_guidance_mq}).
- Flexibility: Flow-NRG allows users to apply custom weights to different reward dimensions (VQ, MQ, TA) during inference to tailor generation to specific needs without retraining the base model (Tab. \ref{tab:reward_guidance_all}). Pseudocode is in Appendix \ref{app:sec:pesudocode}.

3. Experimental Validation

Setup: A pretrained transformer-based latent flow model was used as the base. Alignment methods were applied using LoRA for efficient fine-tuning. The collected preference dataset was relabeled using the trained VideoReward model to provide synthetic ground truth for alignment experiments.
Evaluation: Performance was measured using automatic metrics (VideoReward win rate against the pretrained model, VBench scores) and human evaluations. A challenging prompt set (TA-Hard) was created to specifically test text alignment.
Results:
- Flow-DPO (with constant β) consistently outperformed the pretrained model, SFT (Supervised Fine-Tuning on preferred data only), and Flow-RWR across multiple dimensions and evaluation benchmarks (Tab. \ref{tab:multi_dimension_alignment}, Tab. \ref{tab:single_dimensional_alignment}). Human evaluations confirmed Flow-DPO's effectiveness (Fig. \ref{fig:videogen_eval_win_tie_lose}).
- Flow-NRG successfully improved specific quality dimensions at inference time, demonstrating the utility of the noisy reward model and the weighted guidance approach.

4. Practical Implications and Limitations

The work provides a practical pipeline for aligning modern flow-based video generators using human feedback.
The VideoReward model and VideoGen-RewardBench benchmark are valuable resources for evaluating T2V models.
Flow-DPO with a constant β appears to be the most effective training-time alignment method investigated.
Flow-NRG offers a flexible inference-time approach, but requires training an auxiliary reward model on noisy latents.
Limitations: Excessive DPO training can degrade overall video quality (LoRA helps mitigate this). Reward hacking remains a potential issue. The methods were validated on T2V, extension to other conditional tasks is future work.

In summary, the paper makes significant contributions by creating tailored resources (dataset, reward model, benchmark) for modern video generation and adapting alignment algorithms (DPO, RWR, reward guidance) to the rectified flow paradigm, offering practical methods to improve video quality and prompt alignment (2501.13918). The finding regarding the constant β for Flow-DPO and the necessity of a noisy latent reward model for Flow-NRG are key practical insights for implementation.

PDF Markdown

GitHub

Improving Video Generation with Human Feedback

Tweets

https://twitter.com/xinntao/status/1882725688895234056

https://twitter.com/arxivsanitybot/status/1882975888179868143

https://twitter.com/javaeeeee1/status/1882746137272332800

https://twitter.com/arXivGPT/status/1883214281568620681

https://twitter.com/arXivGPT/status/1883576643383439552

Improving Video Generation with Human Feedback (2501.13918v1)

Summary

Related Papers

GitHub

Tweets