Process-Aware Video Reward Model

Updated 6 December 2025

PAVRM is a framework that formalizes temporally grounded, process-aware reward functions using DTW and ROUGE metrics, enhancing video reasoning accuracy.
It employs group-wise reinforcement learning with distinct perception and reasoning segments to improve credit assignment and benchmark performance.
The architecture integrates with MLLMs and video diffusion models by aligning intermediate outputs, reducing computational overhead and improving sample efficiency.

The Process-Aware Video Reward Model (PAVRM) encompasses a class of methods and architectures designed to provide temporally grounded and interpretable supervision signals for video reasoning, understanding, and generation. PAVRM structures reward feedback—both for multimodal LLMs (MLLMs) and generative video diffusion models—by aligning internal process traces, intermediate reasoning steps, or latent states with reference annotations or human preferences. PAVRM confers improved consistency, fine-grained credit assignment, and sample efficiency across reinforcement learning paradigms and enables state-of-the-art performance on diverse video benchmarks (Tao et al., 25 Sep 2025, Jiang et al., 14 Nov 2025, Mi et al., 26 Nov 2025).

1. Mathematical Definitions of Process-Aware Reward

PAVRM formalizes process-centric supervision using reward functions that determine how closely a model’s intermediate outputs reflect reference processes—be they reasoning chains, perception observations, or latent video states.

For MLLM-based video reasoning (Tao et al., 25 Sep 2025), the process reward is defined as:

$R_{\text{proc}} = T(D_{\text{sdtw}})$ , where $D_{\text{sdtw}}$ is the minimum cumulative distance between the chain-of-thought steps generated by the model and those of the reference annotation, found via a subsequence dynamic time warping (SDTW) algorithm. $T$ is a monotonically decreasing transformation (typically exponential decay).
The pairwise step distances are computed via ROUGE, with $d(g_j, r_i) = 1 - \text{ROUGE}_\text{avg}(g_j, r_i)$ , and the alignment cost matrix $D$ summarizes all similarities.

In process-separated reinforcement fine-tuning for large video LLMs (LVLMs) (Jiang et al., 14 Nov 2025), rewards are attributed to distinct segments:

Perception segment: $R_{\text{acc},\mathrm{P}}$ for sufficient observation, $R_{\text{form},\mathrm{P}}$ for format compliance, $R_{\text{len},\mathrm{P}}$ for minimal and sufficient length.
Reasoning segment: $R_{\text{acc},\mathrm{R}}$ for answer correctness (exact match, ROUGE, or numeric similarity), $R_{\text{form},\mathrm{R}}$ , $R_{\text{len},\mathrm{R}}$ .

For video generation (Mi et al., 26 Nov 2025), the reward model is parameterized as $r_\phi(x_t, t, p)$ , a function operating in the denoising latent space at timestep $t$ . Sequence-level rewards aggregate per-timestep predictions, $R_\phi(x_0, p) = \mathbb{E}_{t \sim U(0,1)} [ r_\phi(x_t, t, p) ]$ , supporting feedback throughout the denoising trajectory.

2. Algorithmic Design and Credit Assignment

PAVRM is operationalized within group-wise reinforcement learning algorithms, notably Group Relative Policy Optimization (GRPO) and its process-aware variant (PA-GRPO).

In (Tao et al., 25 Sep 2025), each sampled model output is evaluated for accuracy, format, and process trace alignment; total reward is $R_{\text{total}} = R_{\text{acc}} + R_{\text{fmt}} + R_{\text{proc}}$ , with standardized advantages $A_i$ computed over reward groups. The process reward (subsequence DTW alignment) is calculated as:

Function SUBSEQUENCE_DTW(D; k_ref, k_gen):
  # D: n×m cost matrix, k_ref ≤ n, k_gen ≤ m
  Allocate P[0..n, 0..m] with P[0,j]=0 ∀j; P[i,0]=∞ ∀i>0
  for i in 1..n:
    for j in 1..m:
      diag = P[i−1,j−1]
      up   = min_{1≤k≤min(k_ref,i)}   P[i−k,j]
      left = min_{1≤k≤min(k_gen,j)}   P[i,j−k]
      P[i,j] = D[i,j] + min(diag, up, left)
    end for
  end for
  return min_{1≤j≤m} P[n,j]
EndFunction

In (Jiang et al., 14 Nov 2025), PA-GRPO normalizes perception and reasoning rewards separately, computing $A_{i,k}$ and $\rho_{i,k}$ per segment $k \in \{\mathrm{P}, \mathrm{R}\}$ , and maximizing the composite objective:

$\mathcal{J}_{\mathrm{PA\text{-}GRPO}}(\theta) = \mathbb{E}_{q,\{o_i\}} \left[ \frac{1}{G} \sum_{i=1}^G \sum_{k} \min(\rho_{i,k} A_{i,k},\, \mathrm{clip}(\rho_{i,k},1-\epsilon,1+\epsilon) A_{i,k}) - \beta\,\mathrm{KL}(\pi_\theta\parallel\pi_{\mathrm{ref}}) \right]$

Process reward feedback learning (PRFL) (Mi et al., 26 Nov 2025) aligns generator outputs in latent space by backpropagating reward gradients through all denoising steps without VAE decoding. At each step, the generator is updated to maximize $r_\phi(x_s, s, p)$ at randomly sampled timesteps $s$ , using:

$x_s = x_{s+\Delta} - \Delta \cdot v_\theta(x_{s+\Delta}, s+\Delta, p)$

$\mathcal{L}_{PRFL}(\theta) = -\lambda \mathbb{E}_{p, s} [ r_\phi(x_s, s, p) ]$

3. Architectural Integration

MLLM-based video understanding implements PAVRM by enforcing output templates with distinct > and <answer> tags (Tao et al., 25 Sep 2025, Jiang et al., 14 Nov 2025), supporting explicit segmentation of chain-of-thought reasoning over video observations. Sentence-splitting routines parse and align generated and reference reasoning steps, with ROUGE-based string matching offering interpretable, model-agnostic supervision.

Process-aware video generation (Mi et al., 26 Nov 2025) realizes PAVRM as a lightweight head on a video diffusion backbone, scoring video latents at arbitrary timesteps via learnable pooled features. This optimizes for temporal structure, motion dynamics, and quality, from initial noise states through complete denoising trajectories, and supports efficient, scalable feedback learning.

In both paradigms, process trace alignment and reward computation are fully rule-based, removing the need for ancillary neural reward models, and are compatible with standard RLHF, PPO, and GRPO-style optimization.

4. Benchmark Results and Empirical Analysis

Empirical evaluations demonstrate that PAVRM consistently improves video reasoning and generation on diverse, large-scale test sets.

Video Reasoning Benchmarks (Tao et al., 25 Sep 2025, Jiang et al., 14 Nov 2025)

MOSS-ChatV with DTW-based process reward achieves 87.2% on the MOSS-Video test split, outperforming baselines and SFT-only models.

On MVBench, VideoMME, TempCompass, MMVU, VSI-Bench, and VCR-Bench, process-aware credit assignment boosts accuracy by 1–3 percentage points relative to process-agnostic or single-scalar reward methods.

Table (main results from (Jiang et al., 14 Nov 2025)):

Model VSI MMUU MMVU VCR MV TempCom VideoMME Avg

Qwen2.5-VL-7B 30.1 48.1 60.0 44.3 59.0 72.6 56.6 52.9

Video-R1 (GRPO) 35.8 52.3 63.8 49.0 63.9 73.2 59.3 56.8

VideoRFT 36.8 51.1 68.5 49.6 62.1 73.7 59.8 57.4

VideoP2R (PA-GRPO) 36.8 55.0 65.4 51.0 68.1 74.5 60.0 58.7

Ablation studies in both (Tao et al., 25 Sep 2025) and (Jiang et al., 14 Nov 2025) confirm that removal of process-specific rewards or failure to segment perception and reasoning degrades final performance, information density, and consistency.

Video Generation (Mi et al., 26 Nov 2025)

PRFL substantially reduces memory (≈67 GB peak for 81 frames at 480P) and training time per step (43.7 s/prfl vs. 64.9 s/rgb-refl), relative to pixel-space reward learning.

Text-to-video alignment metrics improve by tens of points: Dynamic Degree (+46 pts), Human Anatomy (+10.49 pts), with no degradation in motion smoothness, subject consistency, or other fidelity measures.

Human A/B testing favors PRFL outputs in >70% of comparatives.

5. Theoretical Justification and Analytical Insights

Separate process-aware reward attribution addresses intrinsic limitations of scalar RL credit assignment in multi-stage reasoning (Jiang et al., 14 Nov 2025):

Group-wise normalization prevents advantage collapse, maintaining a learning signal for both perception and reasoning steps.

Segmentation suppresses think–answer mismatch, rewarding both faithful and correct inference, and penalizing reward hacking via short or incorrect process traces.

In generative models, latent-space reward enables early-stage and full-trajectory gradient flow, correcting temporal structure and motion coherence prior to visual quality emergence.

The use of rule-based, interpretable alignment strategies (DTW, ROUGE, SDTW) ensures debuggability and domain transferability. PAVRM’s readout can accommodate extensions via semantic similarity metrics, ensemble reference annotations, and alternate modalities (e.g., sequence-level symbolic event alignment for audio-visual tasks).

6. Practical Limitations, Failure Modes, and Extensions

Identified limitations include susceptibility to erroneous or sparse reference traces (overfitting to suboptimal reasoning patterns), under-rewarding novel but valid reasoning, and exploitation of process reward via short outputs (reward hacking). Remedies suggested in (Tao et al., 25 Sep 2025) comprise adaptive relaxation parameters ( $\alpha$ annealing), teachers for reference trace variety, deployment of richer semantic metrics, and integration of learned reward critics for more nuanced scoring.

The latent-space PAVRM architecture (Mi et al., 26 Nov 2025) supports direct extension to multimodal inputs, alternate generative backbones, and all stages of the denoising process, facilitating broader applicability to video RLHF, generative alignment, and temporal process evaluation.

7. Impact and Future Directions

PAVRM establishes a unified methodology for providing temporally resolved and interpretable rewards across video reasoning and generation domains. Its rule-based, process-centric approach augments sample efficiency, robustness, and performance, while retaining transparency desirable in research and deployment. The architecture supports integration into standard RLHF and policy optimization loops with minimal engineering overhead, offering broad compatibility with current and future multimodal foundation models.

Future directions likely include semantic expansion (BERTScore), multireference trace aggregation, audiovisual sequence alignment, dynamic annealing of temporal reward strictness, and hybridization with learned (critic-based) reward signals for enhanced expressivity. This suggests PAVRM is positioned as a central module in evolving multimodal video understanding and generation pipelines.

Model	VSI	MMUU	MMVU	VCR	MV	TempCom	VideoMME	Avg
Qwen2.5-VL-7B	30.1	48.1	60.0	44.3	59.0	72.6	56.6	52.9
Video-R1 (GRPO)	35.8	52.3	63.8	49.0	63.9	73.2	59.3	56.8
VideoRFT	36.8	51.1	68.5	49.6	62.1	73.7	59.8	57.4
VideoP2R (PA-GRPO)	36.8	55.0	65.4	51.0	68.1	74.5	60.0	58.7