Multimodal Process Reward Models (MPRMs)

Updated 4 July 2026

Multimodal Process Reward Models (MPRMs) are frameworks that evaluate intermediate reasoning steps using multimodal inputs instead of a single outcome score.
They operate at various granularities—from token-level to trajectory-level—enabling precise identification and correction of reasoning errors in tasks like vision-language modeling.
MPRMs integrate diverse supervision strategies and data pipelines, such as those used in Perceval and VisualPRM, to improve training efficiency and inference robustness.

Searching arXiv for papers on multimodal process reward models and closely related multimodal PRM work. Multimodal Process Reward Models (MPRMs) are reward models that operate over multimodal inputs and intermediate processes rather than only final outcomes. In the vision–language setting, they take inputs such as an image, a text query, and a model-generated reasoning trace, and produce process-level supervision over the reasoning steps rather than a single scalar score for the final answer. A concrete recent instantiation is Perceval, a perception-centric process reward model for vision–LLMs (VLMs) that identifies hallucinated spans in a chain-of-thought by checking image-related claims against visual evidence (Min et al., 27 Apr 2026). Related work broadens the notion of multimodal process reward modeling to step-level verification for visual reasoning (Wang et al., 13 Mar 2025), data-efficient step supervision (Wang et al., 11 Jun 2025), selective training on Monte Carlo-annotated multimodal reasoning corpora (Li et al., 4 Feb 2026), instance- and domain-reweighted multimodal PRMs (Cao et al., 5 Sep 2025, Cao et al., 26 May 2025), and agentic process judges for GUI tasks (Xiong et al., 27 Sep 2025). The field therefore spans several process granularities: token-level span marking, step-level binary verification, sub-question-level structured verification, and trajectory-level process scoring.

1. Conceptual scope and relation to outcome reward modeling

Process reward models are distinguished from outcome reward models by the location of supervision. Outcome reward models assign a single scalar score to a complete response. Process reward models evaluate intermediate states, steps, or spans inside a reasoning trajectory. In multimodal settings, this distinction matters because reasoning failures often begin with perceptual or grounding errors long before the final answer is produced.

Perceval makes this distinction explicit. Standard RL for VLMs in the RLVR/GRPO style uses a scalar reward $R_i$ per sampled response $o_i$ and applies the sequence-level advantage

$\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$

That advantage is then shared by all tokens of $o_i$ , which is coarse because it cannot diagnose where hallucination begins and gives all tokens equal credit or blame (Min et al., 27 Apr 2026). MPRMs instead provide localized supervision: Perceval identifies exact hallucinated substrings and enables penalties on only those spans, while VisualPRM predicts step quality for each reasoning step conditioned on image, question, and step prefix (Wang et al., 13 Mar 2025).

The literature uses several adjacent terms. “Multimodal reasoning reward models” that generate explicit reasoning trajectories and final preference labels have been interpreted as MPRMs because they supervise a generative judging process rather than only a scalar output (Yang et al., 2 Feb 2026). Structured multimodal verifiers such as StructVRM are not process reward models in the strict step-level sense, but they supply sub-question-level structured rewards that sit between final-answer supervision and full process supervision (Zhang et al., 7 Aug 2025). This suggests a broader family of multimodal process-sensitive reward models in which the essential property is finer-grained credit assignment over a multimodal reasoning process.

A plausible implication is that MPRMs are best understood as a spectrum rather than a single architecture class. At one end are token-level critics such as Perceval; in the middle are step-level verifiers such as VisualPRM and Athena-PRM; further toward structured task decomposition are sub-question-level verifiers such as StructVRM; and in agent settings there are trajectory-level multimodal process judges such as GUI-PRA (Min et al., 27 Apr 2026, Wang et al., 13 Mar 2025, Wang et al., 11 Jun 2025, Zhang et al., 7 Aug 2025, Xiong et al., 27 Sep 2025).

2. Core formulations and output granularities

A defining property of MPRMs is that they consume multimodal context together with a partial or complete reasoning trajectory and return process-sensitive supervision. Perceval takes a triplet $\langle v, q, o\rangle$ , where $v$ is an image, $q$ a text query, and $o$ a model-generated response, typically a chain-of-thought. It outputs a structured verification text $V$ with a > ... phase and an <answer>...</answer> section that either states “The response is correct.” or returns a Python list of exact offending substrings from $o$ that contain perceptual errors (Min et al., 27 Apr 2026). Conceptually, it learns

$o_i$ 0

VisualPRM instead models per-step quality. Given an image $o_i$ 1, a question $o_i$ 2, and a solution decomposed into steps $o_i$ 3, it predicts whether each step is correct conditioned on the multimodal context and previous steps, and aggregates step scores into a trajectory score (Wang et al., 13 Mar 2025). Athena-PRM follows the same step-level paradigm, using a special <step> token to attach binary labels $o_i$ 4 to each reasoning step $o_i$ 5, with training objective

$o_i$ 6

Here the multimodal problem $o_i$ 7 can include text, diagrams, plots, or other images because the model is built on Qwen2.5-VL-7B (Wang et al., 11 Jun 2025).

StructVRM provides another granularity: the final answer is decomposed into sub-question-level binary sub-scores

$o_i$ 8

where each $o_i$ 9 is a list of binary sub-scores for sub-questions or blanks, determined by semantic or mathematical equivalence rather than rigid string matching. These are aggregated as

$\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 0

yielding structured partial credit for multimodal STEM problems (Zhang et al., 7 Aug 2025).

The following table summarizes the output types explicitly described in the cited works.

Model family	Multimodal input context	Process-level output
Perceval	image, query, full response	exact hallucinated substrings / token spans
VisualPRM	image, question, step prefix	binary step correctness / step score
Athena-PRM	multimodal problem, step-marked solution	binary score per `<step>` token
StructVRM	multimodal problem, reference answer, model answer	binary sub-question score vector
GUI-PRA	goal, screenshots, UI state, history, candidate step	scalar score in $\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 1 per candidate action

These representations differ, but all shift supervision inside the reasoning or action process. This suggests that “process” in MPRMs is operationally defined by where reward is attached: spans, steps, sub-answers, or intermediate actions.

3. Data generation and supervision pipelines

The central bottleneck for MPRMs is process supervision. Several data generation paradigms appear in the literature.

One line uses Monte Carlo continuation from partial reasoning prefixes. VisualPRM400K adapts the MathShepherd-style pipeline to multimodal reasoning: for each image-question pair, multiple full solutions are sampled, split into steps, and for each step index $\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 2 a policy MLLM samples $\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 3 continuations from the prefix $\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 4. The expected accuracy

$\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 5

is then binarized to produce step labels (Wang et al., 13 Mar 2025). VisualPRM400K contains about 400K samples and about 2M steps with process supervision, and VisualProcessBench provides a human-labeled step-wise benchmark with 2,866 samples and 26,950 steps (Wang et al., 13 Mar 2025).

Another line emphasizes label quality rather than sheer volume. Athena-PRM argues that conventional automated labeling methods such as Monte Carlo estimation produce noisy labels and proposes weak/strong completer consistency. For each step, a weak completer and a strong completer each generate hard Monte Carlo labels; a step is kept only if both agree: $\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 6 Using this filter, Athena-PRM reports that a 5K process-labeled dataset can match or beat PRMs trained on 300K vanilla Monte Carlo-labeled samples (Wang et al., 11 Jun 2025).

Perceval adopts a different annotation strategy. It is trained with perception-intensive supervised data built from visual search datasets such as DeepEyes and V*, plus a smaller portion from SophiaVL-R1. Responses are generated by Qwen2.5-VL-7B, then a strong external model, Gemini-2.5-Pro, is prompted to analyze $\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 7, identify hallucinations, and produce detailed <think> reasoning and an <answer> list of erroneous substrings. Perceval is then trained by supervised fine-tuning on the resulting verification trajectories (Min et al., 27 Apr 2026).

Data efficiency itself has become a dedicated topic. The paper on “Training Data Efficiency in Multimodal Process Reward Models” studies VisualPRM400K-v1.1 and argues that informative gradient updates depend on two factors: label mixtures of positive/negative steps and label reliability, quantified by average Monte Carlo scores of positive steps. It proposes the Balanced-Information Score (BIS),

$\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 8

where $\hat{A}_i = \frac{R_i - \text{mean}(\{R_j\}_{j=1}^G)}{\text{std}(\{R_j\}_{j=1}^G)}.$ 9 is average MC score on positive steps, to select informative rollouts without additional labeling cost (Li et al., 4 Feb 2026).

A related but distinct data-selection theme appears in DreamPRM and DreamPRM-1.5. DreamPRM performs domain-reweighted training via bi-level optimization over multiple multimodal reasoning datasets to mitigate quality imbalance across domains (Cao et al., 26 May 2025). DreamPRM-1.5 refines this to instance-level reweighting with either an Instance Table or Instance Net, again using bi-level optimization and a meta-validation set to prioritize training examples that improve downstream multimodal reasoning (Cao et al., 5 Sep 2025).

A plausible implication is that multimodal process supervision is shifting from naive large-scale Monte Carlo labeling toward learned or principled selection of high-value trajectories. That interpretation is supported by BIS, DreamPRM, DreamPRM-1.5, and Athena-PRM, but each work operationalizes “high value” differently: rollout mixture and reliability, domain-level utility, instance-level meta-loss, or cross-completer label agreement.

4. Architectures and representative systems

Perceval is implemented by fine-tuning a Qwen2.5-VL backbone in 3B and 7B variants as a judger. A ViT-like visual encoder processes the image into visual tokens, and the LLM receives the query, response, and visual tokens through a connector. It is prompted to parse claims, check them against the image, and output structured verification (Min et al., 27 Apr 2026). The reward model itself contains no scalar reward head; instead, its structured textual output is later converted into a token mask over hallucinated spans.

VisualPRM is an 8B multimodal LLM. A vision foundation model produces visual embeddings from input image $o_i$ 0, a connector maps them into the LLM hidden space, and the LLM processes both vision tokens and the text prefix. It is trained as a generative classifier that predicts discrete label tokens such as “+” or “-” after each reasoning step (Wang et al., 13 Mar 2025). This design makes the step score simply the probability of the positive token at that position.

Athena-PRM uses Qwen2.5-VL-7B with a classification head applied at <step> markers to produce binary correctness probabilities for each step (Wang et al., 11 Jun 2025). An important architectural variation in the same paper is ORM initialization: a large outcome reward model, Athena-ORM, is first trained on outcome-level labels and then used to initialize the PRM, which empirically improves step-level performance (Wang et al., 11 Jun 2025). This suggests that outcome reward modeling can serve as a representation-learning stage for MPRMs.

The “Entropy-Guided Data-Efficient Training for Multimodal Reasoning Reward Models” paper studies a multimodal reasoning reward model that takes an image, textual query, and two candidate responses, then generates a structured natural-language reasoning explanation plus a final preference label (Yang et al., 2 Feb 2026). Although not framed as an MPRM in the paper, it fits the broader family of multimodal generative judges with explicit reasoning trajectories. The reward model is used as a generative verifier rather than a scalar head.

UnifiedReward-Flex occupies a related but more specialized position for vision generation. It is a multimodal reward model that constructs a hierarchical evaluation process over prompts and candidate images or videos, generates dimensions and sub-dimensions, and then outputs dimension-level winners and an overall winner. The model is trained by SFT on structured traces and DPO on full reasoning trajectories (Wang et al., 2 Feb 2026). It evaluates candidate generations rather than reasoning traces from VLMs, but it exemplifies a multimodal process-oriented reward model whose “process” is the generated evaluation rubric.

GUI-PRA extends the notion of MPRM into agentic control. It is a process reward agent for GUI tasks that evaluates candidate thought-action pairs using goal text, compressed history, current screenshots, and tool-generated UI evidence. It introduces a Dynamic Memory mechanism to combat “lost in the middle” and an Adaptive UI Perception mechanism that uses tools such as OmniParser and Point to collect grounded visual evidence before assigning a scalar score in $o_i$ 1 (Xiong et al., 27 Sep 2025). This broadens MPRMs from passive critics to agentic judges.

5. Integration with reinforcement learning, decoding, and test-time scaling

Perceval provides the clearest example of direct RL integration. Starting from sequence-level GRPO with advantage $o_i$ 2, it defines token-level advantages using a hallucination mask $o_i$ 3: $o_i$ 4 where $o_i$ 5 is a penalty strength hyperparameter and $o_i$ 6 indicates whether token $o_i$ 7 lies in a hallucinated span (Min et al., 27 Apr 2026). Non-hallucinated tokens keep the original group-level advantage; hallucinated tokens receive reduced or more negative credit. This turns a process reward model into a token-level critic for RL.

At inference time, Perceval also supports test-time scaling through two iterative correction strategies. In Truncate–then–Regenerate, it finds the earliest hallucinated token, truncates the response there, and asks the policy to continue from the verified prefix; in Truncate–Thinking–then–Regenerate, it additionally inserts a reflective hint distilled from Perceval’s judgment before regeneration. On V* (all) with $o_i$ 8 for the 3B model, majority voting reaches 85.86, Truncate reaches 89.53, and Truncate–Thinking reaches 88.48 (Min et al., 27 Apr 2026).

VisualPRM is primarily used for Best-of- $o_i$ 9 reranking. For a candidate solution $\langle v, q, o\rangle$ 0, step scores are aggregated by average,

$\langle v, q, o\rangle$ 1

and the highest-scoring solution is selected among sampled candidates (Wang et al., 13 Mar 2025). The paper reports that VisualPRM improves multiple MLLM families and scales under larger $\langle v, q, o\rangle$ 2, outperforming both outcome reward models and self-consistency in Best-of- $\langle v, q, o\rangle$ 3 reasoning (Wang et al., 13 Mar 2025).

Athena-PRM studies aggregation strategies for test-time scaling and finds that the minimum step score, PRM-min, is a robust way to score a trajectory: $\langle v, q, o\rangle$ 4 Using Qwen2.5-VL-7B as policy, Athena-PRM improves WeMath by 10.2 points and MathVista by 7.1 points in Best-of- $\langle v, q, o\rangle$ 5 evaluation (Wang et al., 11 Jun 2025). It is also used for reward-ranked fine-tuning: among multiple sampled correct solutions for a problem, the one with the highest PRM score is kept as pseudo-gold for supervised fine-tuning, yielding Athena-7B (Wang et al., 11 Jun 2025).

StructVRM integrates structured sub-question rewards into PPO-style RLVR for multimodal STEM reasoning. The verifier produces sub-question score vectors, which are averaged into a scalar reward used in PPO updates (Zhang et al., 7 Aug 2025). This preserves the benefits of RL with verifiable rewards while avoiding all-or-nothing supervision on multi-question problems.

DreamPRM and DreamPRM-1.5 occupy a distinct place in the pipeline. They are not primarily used to shape token-level RL updates; instead, they improve the training of multimodal PRMs themselves via domain- or instance-level reweighting, then deploy those better-trained PRMs in test-time scaling on MMMU and related multimodal reasoning tasks (Cao et al., 26 May 2025, Cao et al., 5 Sep 2025).

A recurrent theme across these works is that PRMs are not only critics during training but also standalone search heuristics at inference. This suggests that MPRMs have dual value: they can shape gradients during RL and serve as decoders or rerankers that uncover latent reasoning ability in fixed policies.

6. Empirical patterns, misconceptions, and open problems

Several empirical regularities recur across the literature. First, process supervision improves multimodal reasoning beyond direct visual search tasks. Perceval, although applied only on perception-related RL data during training, reports gains that transfer to chart and math tasks relying on fine-grained visual perception (Min et al., 27 Apr 2026). Athena-PRM likewise improves both multimodal and text-only math benchmarks (Wang et al., 11 Jun 2025). This suggests that process supervision aimed at visual grounding can improve broader reasoning competence when perception is a latent bottleneck.

Second, data quality dominates raw data quantity. VisualPRM established the effectiveness of large-scale Monte Carlo supervision (Wang et al., 13 Mar 2025), but subsequent work repeatedly identified redundancy and noise in MC-annotated corpora. BIS shows that full-data performance can be reached with only 10% of the training rollouts on VisualPRM400K-v1.1 when data are selected by mixture and reliability (Li et al., 4 Feb 2026). Athena-PRM reports that 5K high-quality process-labeled samples can replace 300K vanilla MC-labeled samples (Wang et al., 11 Jun 2025). DreamPRM-1.5 shows that instance reweighting can outperform a vanilla PRM trained on the same data without selection (Cao et al., 5 Sep 2025). A common misconception is therefore that MPRMs are bottlenecked only by annotation volume; the cited work indicates that selection, weighting, and label quality are often more important.

Third, finer granularity does not always mean stepwise decoding is the best inference strategy. The paper on training vision-language PRMs for test-time scaling reports that VL-PRMs, when used as outcome reward models during test-time scaling, can outperform PRM-guided process step selection (Ong et al., 27 Sep 2025). This is a useful corrective to the assumption that a process reward model must always be used in a step-by-step search loop. A plausible implication is that process supervision during training and sequence-level usage at inference may sometimes be the best combination.

Fourth, multimodal reward models can overfit unimodal shortcuts. The paper on unimodal spurious correlations in multimodal reward models shows that text-only shortcuts can dominate even in ostensibly multimodal reward modeling, degrading cross-distribution generalization (Li et al., 5 Mar 2025). Although that work studies outcome-level reward models, the diagnosis is directly relevant to MPRMs: a process reward model might still rely on text-only trajectory cues unless its supervision explicitly forces multimodal grounding. This gives additional significance to perception-centric designs such as Perceval and to perception-focused supervision in vision-language PRMs (Min et al., 27 Apr 2026, Ong et al., 27 Sep 2025).

The main open problems are visible across the cited papers. Perceval is currently focused on perception; logically wrong but visually consistent reasoning is not labeled (Min et al., 27 Apr 2026). StructVRM gains robustness by ignoring intermediate reasoning and judging final sub-answers only, but that design cannot detect correct-by-luck answers with flawed chains of thought (Zhang et al., 7 Aug 2025). GUI-PRA demonstrates that agentic, tool-using judges can improve long-horizon control, but it is inference-time only and depends on tool reliability (Xiong et al., 27 Sep 2025). DreamPRM and DreamPRM-1.5 show the importance of meta-distribution alignment, but both rely on representative meta sets close to target tasks (Cao et al., 26 May 2025, Cao et al., 5 Sep 2025). The broader survey literature therefore frames MPRMs as promising but still constrained by label acquisition, multimodal grounding, long-horizon credit assignment, and robustness under distribution shift (Zheng et al., 9 Oct 2025).

A plausible synthesis is that the field is converging on three complementary design principles. First, process rewards should be localized to the error mechanism that matters most in the domain, such as perceptual hallucination, visual grounding, or GUI state change. Second, process supervision should be coupled with selective or reweighted data pipelines rather than assumed to scale linearly with more Monte Carlo rollouts. Third, MPRMs need not be static classifiers: they may be generative verifiers, hierarchical rubric constructors, or agentic judges with memory and tools. These principles are explicit in different subsets of the cited work, and together they define the current research frontier for Multimodal Process Reward Models (Min et al., 27 Apr 2026, Wang et al., 13 Mar 2025, Wang et al., 11 Jun 2025, Li et al., 4 Feb 2026, Zhang et al., 7 Aug 2025, Xiong et al., 27 Sep 2025).