Papers
Topics
Authors
Recent
Search
2000 character limit reached

WMReward: World-Model Rewards for Generation

Updated 3 July 2026
  • WMReward is a framework that uses pretrained world models to evaluate generative outputs through latent surprise and calibrated human feedback.
  • Its methodology employs sliding window latent prediction, cosine similarity-based surprise, and multi-head reward aggregation to ensure physical plausibility and semantic consistency.
  • Empirical results demonstrate improved physical realism and human preference in video generation and embodied tasks using gradient guidance and best-of-N search strategies.

WMReward (World-Model Reward) is a class of reward modeling frameworks designed to evaluate and guide generative models by leveraging the predictive capacities of learned or pretrained world models. WMReward provides scalar signals that judge the plausibility, fidelity, or preference-aligned quality of generative outputs—most commonly in video modeling and LLM alignment. By moving beyond direct pixel-level or likelihood-based criteria, WMReward allows alignment along semantic, dynamical, or human-preferred axes using proxies such as latent future consistency or calibrated human feedback aggregation.

1. Formal Evolution and Context

WMReward was originally introduced as an inference-time reward function applicable to video diffusion models for enforcing physics plausibility in generated rollouts (Yuan et al., 15 Jan 2026). Independently, the term "WMReward" has been adopted in the context of embodied world models to denote multi-dimensional reward functions capturing orthogonal criteria such as physical realism, temporal dynamics, and semantic logic (Peng et al., 18 Jan 2026). In reinforcement learning for LLMs, analogous constructs (e.g., WildReward) have demonstrated that world-model-style reward signals can be learned purely from in-the-wild human feedback, achieving pointwise calibration and high sample efficiency (Peng et al., 9 Feb 2026).

This unification of signalization strategies across modalities draws on the observation that traditional simple rewards (e.g., MSE, CLIP, or reconstruction likelihoods) are impoverished for capturing real-world trustworthiness, task fitness, or nuanced semantics.

2. Definition and Mathematical Formulation

In its canonical instantiation for video generative models, WMReward computes the average latent “surprise” between predicted and actual future features using a strong world model such as VJEPA-2 (Yuan et al., 15 Jan 2026). Let xx be a generated video of TT frames. Given Eθ()E_\theta(\cdot) (context encoder), Pϕ(Δm,z)P_\phi(\Delta_m, z) (predictor), and an EMA target Eˉθ\bar E_\theta, the procedure is:

  1. For each sliding window over the video, mask out MM future frames.
  2. Predict masked-future features z^kfut=Pϕ(...,Eθ(context frames))\hat z_k^{\mathrm{fut}} = P_\phi(..., E_\theta(\text{context frames})).
  3. Obtain ground-truth features zkfut=Eθ(context + future frames)z_k^{\mathrm{fut}} = E_\theta(\text{context + future frames}).
  4. Compute cosine surprise sk(x)=1cos(z^kfut,zkfut)s_k(x) = 1 - \cos(\hat z_k^{\mathrm{fut}}, z_k^{\mathrm{fut}}).
  5. Average across windows:

r(x)=1KkKsk(x)r(x) = \frac1{|\mathcal K|} \sum_{k\in\mathcal K} s_k(x)

High TT0 indicates agreement with the world model’s learned dynamics and thus physical plausibility.

In multi-dimensional frameworks, as in ReWorld (Peng et al., 18 Jan 2026), WMReward is a sum of head-specific scalar outputs:

TT1

where each TT2 derives from a dedicated lightweight head on top of a frozen video backbone, trained to specialize via dimension-targeted pairwise preference loss.

3. Inference-time Alignment and Optimization

WMReward enables test-time alignment of generative models by tilting or searching the output distribution towards higher-rewarded samples:

  • Gradient-based guidance: Add reward gradients to the base model's denoising step to favor TT3 with high TT4,

TT5

  • Best-of-TT6 (BoN) search: Sample TT7 independent outputs, apply TT8, and select the argmax.
  • Combined scheme: Generate TT9 guided samples and select the best by Eθ()E_\theta(\cdot)0, scaling search for higher plausibility.

Pseudocode and dynamic programming variants are specified for efficient implementation (Yuan et al., 15 Jan 2026).

4. World Models as Physics and Trustworthiness Priors

WMReward's effectiveness depends on the inductive biases encoded in the underlying world model. For video generation, VJEPA-2 is trained by reconstructing masked spatiotemporal cubes under self-supervised objectives, inducing a strong prior for physical dynamics, object continuity, and interactions—while largely ignoring superficial appearance (Yuan et al., 15 Jan 2026).

In embodied or reinforcement learning settings, WMReward can be implemented as a multi-head reward model with each head aligned to an orthogonal aspect of desirable behavior (physical fidelity, task completion, etc.), as in the InternVideo2-based architecture for ReWorld (Peng et al., 18 Jan 2026).

5. Empirical Evaluations and Key Results

WMReward has demonstrated strong empirical gains across multiple benchmarks and modalities:

  • PhysicsIQ (video generation): WMReward-equipped models achieve up to +6.8 percentage points higher plausibility versus baseline (62.0% vs. 55.2% in V2V PhysicsIQ), with first place (62.64%) in the ICCV 2025 PhysicsIQ Challenge (Yuan et al., 15 Jan 2026).
  • VideoPhy (video physics evaluation): Overall plausibility scores increase substantially when WMReward-guided generation is applied (Yuan et al., 15 Jan 2026).
  • Human preference studies: Side-by-side comparisons show significant user preference for WMReward-aligned outputs in both physical plausibility and visual quality, without loss in general perceptual metrics (Yuan et al., 15 Jan 2026).
  • Robot learning/embodied tasks: In ReWorld, WMReward alignment boosts aggregate realism and task success score S_ReWorld by nearly 14% over fine-tuned base models, with >85% human preference (Peng et al., 18 Jan 2026).
  • LLM reward modeling (WildReward): Ordinal regression on raw user feedback achieves calibration (ECE ≈ 2.8%), high ROC-AUC (≈ 0.91), and performance on preference benchmarks competitive with traditional reward models—without expensive preference-pair annotation (Peng et al., 9 Feb 2026).

6. Limitations and Computational Considerations

WMReward inherits constraints from the supporting world model:

  • Failure to capture rare or abrupt events, fine material properties, or scene compositionality (in video, e.g., mirror reflections, fluid overflow) (Yuan et al., 15 Jan 2026).
  • Computational cost: gradient-based guidance incurs up to Eθ()E_\theta(\cdot)1 per-sample inference, BoN scales linearly with sample count, and joint application can multiply cost further (Yuan et al., 15 Jan 2026). In RL video settings, evaluation of high-dimensional reward heads increases per-update cost (Peng et al., 18 Jan 2026).
  • Text-agnostic design in canonical video WMReward may weaken alignment in semantic story or T2V tasks (Yuan et al., 15 Jan 2026).
  • Need for hand-tuning of reward weights in multi-head systems; future meta-learning of these weights is proposed (Peng et al., 18 Jan 2026).

7. Future Directions and Extensions

Key extensions include:

  • Scaling latent world models for broader or more specialized physical regimes.
  • Integrating compositional, text-conditioned world models for improved semantic flexibility.
  • Combining WMReward with other reward classes (semantic, style, or language-based) for multi-objective steering (Yuan et al., 15 Jan 2026).
  • Efficient search and RL optimization (e.g., SMC, SVDD, CFM-likelihood proxies, DPO with in-the-wild feedback) (Peng et al., 18 Jan 2026, Peng et al., 9 Feb 2026).
  • Deployment in online RL, controlling embodied agents, or zero-shot adaptation (Peng et al., 18 Jan 2026).
  • Compression for real-time inference or on-device alignment.

A plausible implication is that WMReward may expand beyond current generative paradigms, supporting calibration, sample selection, and adaptation in any setting where learned latent predictive models can serve as strong priors for evaluating or influencing generation.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to WMReward.