Papers
Topics
Authors
Recent
Search
2000 character limit reached

Flow-RWR: Flow-based Reward-Weighted Regression

Updated 16 April 2026
  • Flow-RWR is an alignment algorithm designed for flow-based models that enhances video generation by using reward-weighted regression, omitting extra KL regularization.
  • It employs an EM-style optimization and exponential reward weighting to prioritize high-quality motion trajectories, reducing artifacts like frame jitter.
  • Experimental comparisons show improvements in visual and motion quality over supervised fine-tuning, though prompt alignment gains are more modest compared to Flow-DPO.

Flow-RWR (Flow-based Reward-Weighted Regression) is an alignment algorithm designed for flow-based generative models, with particular application to video generation conditioned on prompts. It extends the reward-weighted regression (RWR) paradigm from diffusion models to velocity-predicting rectified-flow architectures, incorporating human or automated feedback via reward functions. Flow-RWR optimizes models to generate outputs preferred by a reward model, enabling refinements in motion quality and, to a moderate extent, prompt alignment, without the need for additional KL-regularizers.

1. Theoretical Foundations

Flow-RWR is grounded in the RL from human feedback (RLHF) framework, aiming to optimize a conditional generative policy pθ(x0y)p_\theta(x_0 \mid y)—where x0x_0 represents generated data (e.g., a video) and yy a prompt—subject to a regularizer that constrains deviation from a reference policy prefp_{\rm ref}:

maxpθ{EyD,x0pθ(y)[r(x0,y)]β DKL[pθ(y)pref(y)]}.\max_{p_\theta} \Big\{ \mathbb{E}_{y \sim \mathcal D,\,x_0 \sim p_\theta(\cdot \mid y)} [r(x_0, y)] - \beta\ \mathrm{D}_{\mathrm{KL}}[p_\theta(\cdot|y) \,\|\, p_{\rm ref}(\cdot|y)] \Big\}.

The analytic solution to this objective is a Boltzmann-reweighted distribution:

pθ(x0y)=1Z(y)pref(x0y)exp(1βr(x0,y)),p_\theta(x_0 \mid y) = \frac{1}{Z(y)}\, p_{\rm ref}(x_0 \mid y) \exp\left(\frac{1}{\beta} r(x_0, y)\right),

where Z(y)Z(y) normalizes the quasi-posterior.

Flow-RWR adopts an EM-style learning approach, but omits the KL/divergence regularizer, instead weighting the standard regression loss by exp(r(x0,y))\exp(r(x_0, y)) to emphasize high-reward samples.

2. Formulation and Training Objective

For rectified-flow models, which predict the velocity field v=ϵx0v = \epsilon - x_0 instead of directly generating data or denoising, Flow-RWR minimizes:

$\mathcal{L}_{\rm Flow\mbox{-}RWR}(\theta) = \mathbb{E}_{y,x_0,\epsilon,t} \Big[ \exp(r(x_0, y))\, \|v - v_\theta(x_t, t, y)\|^2 \Big],$

where x0x_00, x0x_01, x0x_02, and x0x_03 is the model's velocity prediction at noisy input x0x_04. The reward x0x_05 is typically obtained from a trained reward model such as VideoReward. This loss attaches exponentially greater importance to high-reward trajectories during optimization.

3. Implementation Details

Training follows standard deep learning pipelines. For each minibatch: real data x0x_06 and prompts x0x_07 are sampled, perturbed with Gaussian noise and interpolated to x0x_08. The model predicts x0x_09, which is compared to the ground-truth yy0. The MSE loss is multiplied by yy1, where the reward model provides yy2. Optionally, the weights yy3 are normalized to control training dynamics.

A canonical PyTorch-style pseudocode sketch is as follows:

yy7

Key experimental hyperparameters include Adam optimizer (learning rate yy4), batch size 64, a single training epoch over the relabeled dataset, and LoRA rank 64 with yy5 and dropout 0 on transformer projections (Liu et al., 23 Jan 2025).

4. Comparative Performance

Flow-RWR has been systematically evaluated against Supervised Fine-Tuning (SFT), Flow-DPO (a Direct Preference Optimization method for flows), and raw pretrained models. Metrics span Vbench (total, "Quality," and "Semantic" scores), VideoGen-Eval (Visual Quality, Motion Quality, Text Alignment), and TA-Hard (difficult prompt set, human-evaluated).

Method Vbench Total Quality Semantic VQ MQ TA
Pretrained 83.19 84.37 78.46 50.0 50.0 50.0
SFT 82.31 83.13 79.04 51.28 65.21 52.84
Flow-RWR 82.27 83.19 78.59 51.55 63.90 53.43

Flow-RWR improves Visual and Motion Quality over SFT on both Vbench and VideoGen-Eval, with more modest text alignment gains. Flow-DPO, especially with a well-chosen temperature parameter yy6, typically achieves the best prompt alignment and overall scores (Liu et al., 23 Jan 2025).

5. Qualitative and Practical Considerations

Flow-RWR consistently yields smoother video motion and reduces artifacts such as frame jitter compared to basic supervised approaches. This arises because samples ranked highly by the reward model exert substantially greater influence during training. However, performance in aligning generated content to prompts (Text Alignment) does not surpass that of Flow-DPO under optimal hyperparameters.

The algorithm is subject to sensitivity in reward scaling. Overweighting a small subset of high-reward samples can destabilize training, motivating normalization of weights or temperature tuning for stability. Computationally, Flow-RWR incurs additional expense due to batchwise reward model evaluation and exponentiation at each step.

6. Limitations and Prospective Developments

Flow-RWR's principal challenges include managing variance in reward weighting and scaling for large reward model architectures. Future work proposed includes:

  • Temperature annealing or clipping for stable weight distributions,
  • Optional integration of a KL regularizer to limit drift from the reference model,
  • Cross-domain application to generation modalities beyond video (e.g., image-to-video, audio),
  • Exploration of policy-gradient methods such as PPO within flow-based model classes.

A plausible implication is that Flow-RWR's simplicity and compatibility with off-the-shelf reward models make it appealing for rapid alignment of flow-based generators when preference data is accessible, but further advances in reward modeling and optimization stability are needed to match the text alignment strengths of relative preference optimization approaches (Liu et al., 23 Jan 2025).

7. Relation to Broader Flow-Matching and RL Paradigms

Flow-RWR participates in a broader methodological context wherein reward-weighted objectives are used to adapt flow-matching (and diffusion) models for control, robotics, and aligned generation:

  • In generalist robotics, flow-matching with reward weighting can surpass suboptimal policy demonstrators, both by amplifying preferred behaviors seen in data and by augmenting exploration beyond the original support via trajectory perturbations (Pfrommer et al., 20 Jul 2025).
  • The exponential reward weighting paradigm is shared with RWR-style methods in RL and diffusion model alignment, embodying the principle that sample quality, as encoded by a learned reward, provides a robust signal for improving velocity-based generation systems.

This suggests that Flow-RWR serves as a key bridge from imitation and preference-based policy optimization to highly scalable and controllable generative modeling under flow-based architectures.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Flow-RWR.