Video Reward Models
- Video reward models are learned functions that map video inputs to scalar rewards, providing fine-grained evaluation for generative modeling and reinforcement learning.
- They utilize diverse architectures—including generative, discriminative, and semi-scalar approaches—built on video LLMs, transformers, and diffusion models to capture temporal and spatial fidelity.
- Applications range from text-to-video synthesis to RL from demonstrations, delivering improved preference accuracy and efficient optimization in video generation tasks.
Video Reward Models are learned functions or model-based algorithms that evaluate the quality, alignment, or behavioral fidelity of video sequences in generative modeling, reinforcement learning, and video understanding tasks. By mapping video inputs (often alongside prompts or questions) to scalar reward signals, these models provide critical feedback for preference optimization, supervision, and automated evaluation. They are integral to modern text-to-video synthesis, RL from demonstration, and multimodal alignment pipelines, enabling fine-grained supervision not just of semantic content but of temporal structure, safety, and fairness. Recent advances encompass discriminative models, mixture-of-experts architectures, patch-level rewards, latent-space reward modeling, and physically grounded predictive architectures.
1. Taxonomy and Core Definitions
Video Reward Models fall into three principal categories, systematically enumerated in VideoRewardBench (Zhang et al., 30 Aug 2025):
- Generative Reward Models: Prompted LVLMs (e.g., GPT-4o, LLaVA-Critic, UnifiedReward) output textual verdicts or perform pairwise ranking, including “fast-thinking” (direct ranking) and “slow-thinking” (chain-of-thought with RL) pipelines.
- Discriminative Reward Models: Directly output scalar reward values for (video, prompt, candidate-response) tuples (e.g., Skywork-VL-Reward, IXC-2.5-Reward (Zang et al., 21 Jan 2025)).
- Semi-Scalar Reward Models: Generate textual critiques, then map (video, prompt, response, critique) to scalars via secondary critics (e.g., MM-RLHF-Reward).
Video reward models are implemented atop backbone architectures such as VideoLLMs (InternVL2, Qwen2.5), discriminative vision encoders (CLIP, ViT), spatio-temporal transformers (VideoMAE, V-JEPA, Mantis-Idefics), or as adapters within diffusion-model latent spaces (Mi et al., 26 Nov 2025).
MRMs (Multimodal Reward Models) are formally defined as learned functions , scoring the match quality between a video-text prompt and response . In RL contexts, reward models serve as dense, temporally informative surrogate reward functions, obviating hand-engineering.
2. Architectures, Training Objectives, and Benchmarks
Fine-Grained Reward Models:
MJ-VIDEO (Tong et al., 3 Feb 2025) advances a two-layer Mixture-of-Experts reward model: the first level routes each prompt-video pair to one of five aspect experts (Alignment, Safety, Fineness, Coherence & Consistency, Bias & Fairness); the second routes to 28 fine-grained criteria. Training involves:
where is aspect routing weights and are normalized criterion scores.
The dataset, MJ-BENCH-VIDEO, covers 5,421 prompt-video pairs annotated along 28 criteria. Training objectives combine MSE regressions with pairwise logistic preference margins. Notably, MJ-VIDEO achieves +17.58% over prior baselines in strict preference accuracy.
Latent Reward Models and Process-Aware Learning:
Mi et al. (Mi et al., 26 Nov 2025), as well as DOLLAR (Ding et al., 2024), demonstrate that efficient reward feedback learning and optimization can be performed entirely in the noisy latent space of modern video generators. The Process-Aware Video Reward Model (PAVRM) aggregates spatio-temporal features with a query attention mechanism, computing scalar rewards at arbitrary denoising timesteps:
Preference optimization can then backpropagate gradients through the full video generation process in latent space, sharply reducing compute and memory while shaping both high-level motion and visual structure.
Patch-Level Reward Modeling:
HALO (Wang et al., 4 Feb 2025) distills patch-level reward models from GPT-4o labels, aligning patch rewards with global video scores (VideoScore backbone). Collaboration between global and local reward losses via Gran-DPO yields significant gains in VBench and VideoScore metrics, specifically suppressing localized hallucinations.
Temporal Consistency and Physics-Based Rewards:
Video Consistency Distance (Aoshima et al., 22 Oct 2025) introduces frequency-domain Wasserstein-based frame distances as a differentiable reward for temporal alignment in I2V tasks. PhysicsIQ Challenge (Yuan et al., 22 Oct 2025) and VJEPA-2 enable physics-plausibility evaluation by measuring contrastive-predictive similarity between predicted and actual embeddings, enhancing motion realism by up to 6%.
Table: Key Reward Model Architectures
| Model Name | Type | Backbone/Expert Routing |
|---|---|---|
| MJ-VIDEO | MoE discriminative | InternVL2-2B, aspect+criteria gating |
| IXC-2.5-Reward | Scalar discriminative | InternLM-7B LLM + vision encoder |
| VR-Thinker | Reasoning/CoT | Qwen2.5-VL-7B with tool-calls, memory |
| HALO | Patch+Global DPO | Mantis-Idefics (VideoScore), GPT-4o |
| PAVRM / PRFL | Latent discriminative | DiT blocks, query-attention MLP |
3. Integration in RL and Generative Model Training
Video reward models serve two central functions:
- As surrogate reward functions driving RL agents (robotics, video-based policy learning). Examples: Diffusion Reward (Huang et al., 2023), TimeRewarder (Liu et al., 30 Sep 2025), VIPER (Escontrela et al., 2023), TeViR (Chen et al., 26 May 2025), GenReward (Wang et al., 30 Nov 2025). These approaches leverage generative models or temporal distance networks trained on passive expert videos to compute dense, context-aware per-timestep rewards, replacing sparse programmatic signals.
- TimeRewarder trains a CLIP-based, two-hot discretized temporal distance network, with stepwise reward .
- Diffusion Reward penalizes conditional entropy (generative diversity) of a diffusion model given expert context, yielding:
- As preference/alignment supervisors for video generation models (text-to-video, image-to-video diffusion): DOLLAR (Ding et al., 2024) and PRFL (Mi et al., 26 Nov 2025) refine student models by backpropagating reward gradients in latent space, using adapters regressed to pixel-space or human-preference scores.
For RL, these reward models integrate with actor-critic methods (e.g., PPO, DrQv2, DreamerV3) and facilitate dense, temporally ordered feedback critical for learning long-horizon behaviors. In generative pipelines, reward models combine with diffusion samplers via reward gradients (VADER (Prabhudesai et al., 2024)) or DPO objectives (HALO (Wang et al., 4 Feb 2025)).
4. Evaluation, Benchmarks, and Empirical Performance
Benchmarks:
- MJ-BENCH-VIDEO (Tong et al., 3 Feb 2025): 5,421 video-query pairs, 28 criteria.
- VideoRewardBench (Zhang et al., 30 Aug 2025): 1,563 triplets spanning perception, reasoning, knowledge, safety.
- VBench, VideoScore, GenAI-Bench: multi-dimensional video fidelity, consistency, and alignment metrics.
Performance:
- MJ-VIDEO achieves strict overall accuracy, outperforming InternVL2-2B (+17.58%), validated on aspect-level and fine-grained criteria (Tong et al., 3 Feb 2025).
- TimeRewarder reaches near-perfect success on 9/10 Meta-World tasks with 200k interactions (Liu et al., 30 Sep 2025).
- Latent reward optimization (DOLLAR (Ding et al., 2024), PRFL (Mi et al., 26 Nov 2025)) yields up to increase in human preference and reduces training memory/compute by .
- In video understanding, discriminative MRMs (IXC-2.5-Reward) achieve on VL-RewardBench (video+image), on WildVision (Zang et al., 21 Jan 2025).
Ablations: Patch reward suppression, removal of aspect MoE layers, truncated reward backpropagation, and temporal window choices substantially affect success rates and alignment scores.
5. Applications and Impact
Video reward models are deployed in:
- Text-to-video diffusion alignment: Modular reward networks, patch-level supervision, and latent-space preference optimization steer outputs toward human-like fidelity, semantic alignment, and safety (Tong et al., 3 Feb 2025, Ding et al., 2024, Wang et al., 4 Feb 2025).
- Reinforcement learning from video demonstration: Dense rewards constructed from generative likelihoods, entropy, temporal distances, latent similarity, or critic scores accelerate learning and enable cross-domain generalization (Escontrela et al., 2023, Huang et al., 2023, Liu et al., 30 Sep 2025, Chen et al., 26 May 2025).
- Video-LLM reasoning and evaluation: MRMs facilitate automated judge selection, response filtering, RLHF, and data curation (ReAgent-V's multi-agent real-time reward pipeline (Zhou et al., 2 Jun 2025)).
- Robotics and embodied manipulation: Language-conditioned and video-derived critics (VLC (Alakuijala et al., 2024), Video2Reward (Zeng et al., 2024)) augment sample efficiency and transfer reward functions across robot types.
Contemporary Limitations: Reward model bias—stemming from annotation skew or expert video diversity—can propagate unfairness or poor safety judgments; context budgets, patch selection, and non-differentiable reward composition challenge generalization. Future work includes hybridization with multi-modal signals, real-time feedback, and continual learning (Tong et al., 3 Feb 2025).
6. Advanced Methodologies and Future Directions
Emerging paradigms expand video reward modeling via:
- Reasoning-based multimodal reward models: VR-Thinker (Wang et al., 12 Oct 2025) incorporates dynamic tool-calls, visual-memory windows, and chain-of-thought reinforcement, boosting long-video preference accuracy (GenAI-Bench , MJ-Bench-Video ).
- Latent-space process-aware reward feedback: PRFL (Mi et al., 26 Nov 2025) demonstrates end-to-end preference supervision throughout the diffusion chain, guiding both early-stage motion and late-stage anatomy without VAE decoding.
- Physics-accuracy reward metrics: VJEPA-2 (Yuan et al., 22 Oct 2025) applies predictive-contrastive SSL embeddings to steer MAGI-1 generations toward plausible dynamical video continuations (PhysicsIQ: +6.3% absolute).
- Local-global reward integration: HALO (Wang et al., 4 Feb 2025) quantifies the value of spatial reward variance—patch-level defects detected and suppressed for globally elevated fidelity.
- Dense progress estimation from passive video: TimeRewarder (Liu et al., 30 Sep 2025) and Diffusion Reward (Huang et al., 2023) provide scalable, stepwise reward signals applicable even to human or out-of-domain videos.
Table: Key Benchmarks and Reported Metrics
| Benchmark | #Samples | Top Acc (2025) | SOTA Model(s) |
|---|---|---|---|
| MJ-BENCH-VIDEO | 5,421 | 68.75% strict | MJ-VIDEO |
| VideoRewardBench | 1,563 | ~63% | LLaVA-Critic-72B |
| GenAI-Bench | >180k | 82.3% tau | VR-Thinker |
| VL-RewardBench | 1,250 | 84.7% | IXC-2.5-Reward |
7. Open Problems and Recommendations
- Cross-modal generalization remains challenging; RL-based MRMs can underperform strong SFT or critic-tuned baselines in video (Zhang et al., 30 Aug 2025).
- Aggregation of inference-time outputs aids non-deterministic reward models but not discriminative ones.
- Frame sampling strategies, aspect routing, and reward multidimensionality must be harmonized with backbone LM architectures for scalable training.
- Explicit patch-level or temporally consistent reward heads are essential for controlling hallucinations and flicker (Wang et al., 4 Feb 2025, Aoshima et al., 22 Oct 2025).
- Integration with physics-predictive, language-scored, and user-in-the-loop feedback suggests rich future directions.
- Community open sourcing of reward heads, expert pools, and evaluation scripts will accelerate progress and reproducibility (Tong et al., 3 Feb 2025, Zhang et al., 30 Aug 2025).
Video Reward Models constitute the backbone of controlled, sample-efficient, and human-aligned video generation and understanding systems. Continued advances in latent-space modeling, multimodal reasoning, patchwise critique, and physical plausibility metrics will propel the field toward robust, generalizable, and interpretable video synthesis and RL.