VisualPRM: Step-wise Reward for MLLMs

Updated 29 November 2025

VisualPRM is a multimodal Process Reward Model that assigns step-level rewards to reasoning processes for enhanced evaluation in vision-language tasks.
It employs an 8B-parameter architecture with a pre-trained vision encoder, chat-style text encoder, and a discrete classification head for step quality assessment.
The method achieves significant gains, with improvements up to +8.9 points over traditional outcome reward models in Best-of-N CoT selection across various benchmarks.

VisualPRM is an advanced multimodal Process Reward Model (PRM) developed to enhance the step-wise reasoning evaluation capabilities of Multimodal LLMs (MLLMs). Unlike conventional Outcome Reward Models (ORMs) that score only the overall outcome of reasoning chains, VisualPRM assigns step-level rewards to reasoning processes involving both language and visual inputs. This enables more granular and interpretable evaluation, particularly important for complex, multi-step vision-language tasks where errors are often localizable to specific reasoning stages. VisualPRM’s architecture, data pipeline, and evaluation methodology were introduced to address the limitations of step-agnostic critics and propel advancements in the systematic evaluation and optimization of MLLMs (Wang et al., 13 Mar 2025).

1. Motivation and Context

Multimodal LLMs, exemplified by models such as InternVL2.5, QwenVL, and LLaVA, have demonstrated proficient recognition and perception skills but continue to lag behind proprietary systems (e.g., GPT-4o, Gemini) in executing complex, stepwise visual-linguistic reasoning. Traditional selection strategies, such as Best-of-N (BoN) with ORMs or self-consistency voting, are effective in text-only domains but encounter two major hurdles in multimodal settings: the absence of a sufficiently strong critic able to reliably differentiate among similar multimodal Chain-of-Thoughts (CoTs), and the absence of dedicated benchmarks for "step judges" tailored to multimodal reasoning.

Standard ORMs, which collapse entire CoT traces into a single scalar reward, lack diagnostic resolution over process steps and are often unable to detect or localize intermediate reasoning errors. VisualPRM addresses this by constructing a critic capable of providing per-step evaluation, hence facilitating more effective BoN strategies and offering actionable diagnostic feedback on the reasoning trajectory (Wang et al., 13 Mar 2025).

2. Model Architecture and Step-wise Reward Computation

VisualPRM utilizes an 8-billion-parameter backbone built on the InternVL2.5-8B connector architecture. The core components include:

Vision Encoder: A pre-trained VFM provides image features.
Text Encoder/Decoder: A multi-turn, chat-style LLM contextualizes the image, question, and evolving reasoning steps; each turn includes a new reasoning step appended to the dialogue.
Classification Head: At each step, a lightweight head predicts a discrete step-quality token ("+" for correct, "–" for incorrect).

Given an image–question pair $x = (I, q)$ and a chain of $T$ reasoning steps $\pi = (s_1, \ldots, s_T)$ , VisualPRM defines the process reward:

$R_{\text{process}}(x, \pi) = \sum_{t=1}^T r_t(s_{\leq t})$

with step reward $r_t$ determined by one of two schemes:

Value-based PRM: Estimates $mc_i$ , the empirical rate of correct completions from step prefix $s_{\leq i}$ , and sets $r_i = 1$ if $mc_i > 0$ , otherwise $r_i = 0$ .
Advantage-based PRM: Compares current to previous prefix, $r_i = 1$ if $mc_i - mc_{i-1} > 0$ , $r_i = 0$ if unchanged, and $r_i = -1$ if worsened.

In practice, the model outputs a token distribution at each step, and the CoT score is the average step score over $T$ turns.

3. Data Pipeline and the VisualPRM400K Corpus

Training data for VisualPRM is derived from MMPR v1.1 (a multimodal preference dataset) and base CoTs from InternVL2.5 models. The pipeline follows a Monte Carlo process:

Each solution is broken into up to 12 steps.
For step index $i$ , the prefix $s_{\leq i}$ is fixed; 16 random alternative completions are sampled for the remaining steps.
The expected correctness $mc_i$ is computed as the proportion of completions culminating in a correct final answer.

This process yields approximately 400,000 (image, question, step) triples, covering around 2 million steps (mean 5.6 steps per CoT), with about 10% of steps labeled incorrect and an average step length of 22.6 words.

4. Evaluation: VisualProcessBench and Best-of-N Selection

VisualPRM evaluation employs VisualProcessBench, a benchmark consisting of 2,866 image–question (IQ) pairs and 26,950 reasoning steps, sampled from datasets including MMMU, MathVision, MathVerse-VO, DynaMath, and WeMath. Solutions are generated by four MLLMs (GPT-4o, Claude-3.5, QvQ-72B, InternVL2.5-78B), and are annotated by 13 experts on a per-step basis as correct, incorrect, or neutral.

VisualPRM’s selection mechanism under BoN proceeds as follows:

For each IQ, N CoTs are sampled from the policy MLLM.
Each CoT $\pi$ is scored by averaging step-level scores.
The CoT with the maximal score is chosen:

$\pi^* = \arg\max_{\pi \in N} \frac{1}{T} \sum_{t=1}^T \text{Pr}("+"|t)$

Compared strategies include self-consistency (SC) voting and an outcome-only ORM. BoN evaluation demonstrates VisualPRM’s superior CoT selection—gains relative to SC and ORM increase as $N$ grows, with PRM outperforming them by up to +4.3 points at $N=128$ on InternVL2.5-8B.

Overall Accuracy across Benchmarks (Best-of-8 strategy):

Model	Base	+VisualPRM	Δ
MiniCPM-V2.6-8B	29.5	37.5	+8.0
Qwen2.5-VL-7B	41.4	45.1	+3.7
InternVL2.5-8B	32.8	41.2	+8.4
InternVL2.5-26B	36.9	45.8	+8.9
InternVL2.5-38B	44.4	50.7	+6.3
InternVL2.5-78B	46.0	51.9	+5.9

VisualPRM achieves a macro-F1 of 62.0 on VisualProcessBench, exceeding the scores of GPT-4o (60.3) and Gemini-2.0-Flash (62.3), and surpassing open-source MLLMs (50–53) (Wang et al., 13 Mar 2025).

5. Advancements: Comparison with VRPRM

VRPRM ("Visual Reasoning Process Reward Model") extends VisualPRM by integrating explicit chain-of-thought (CoT) reasoning and a policy-style multimodal reward head (Chen et al., 5 Aug 2025). VRPRM employs a two-stage training regime:

Stage I (Supervised Fine-Tuning): Uses 3.6 K high-quality CoT-PRM samples for cold start, produced via prompting and strict filtering.
Stage II (Reinforcement Learning): Reinforces CoT-activated skills using 50 K non-CoT, step-wise annotated samples. The reward integrates both format compliance and process accuracy, optimized via Group Relative Policy Optimization (GRPO).

VRPRM’s architecture fuses visual and text representations at each step:

$r_i = f_\phi\left(g_\theta(h_i),\,E_\psi(v)\right)$

where $E_\psi$ is the visual encoder, $g_\theta$ the text encoder, and $f_\phi$ the reward predictor.

Despite using only 53.6 K training samples (compared to VisualPRM's 400 K), VRPRM achieves substantial improvements: on VisualProcessBench, VRPRM-7B yields AEI = 66.00 versus VisualPRM’s 61.03, and in BoN evaluation on InternVL2.5-8B jumps from VisualPRM's accuracy of 42.06% to 59.65%, a relative gain of +41.8%. This highlights the efficacy of combining minimal high-quality CoT annotation with reinforcement on inexpensive, step-labeled data (Chen et al., 5 Aug 2025).

6. Strengths, Limitations, and Open Challenges

VisualPRM is the first multimodal PRM to demonstrate consistent BoN gains across diverse MLLM families and scales, including substantial improvements on larger models such as InternVL2.5-78B (+5.9 points). It provides step-level granularity and diagnostic power competitive with proprietary systems using only an 8B-parameter backbone. The macro-F1 results and BoN performance underscore its effectiveness over both SC and ORM baselines.

Original limitations include noisy and imbalanced training data (from the Monte Carlo pipeline), non-trivial inference latency, and lack of architectural scaling beyond the 8B-parameter model. VRPRM addresses several of these by using a small but high-fidelity CoT-annotated set and a reinforcement learning stage, reducing annotation cost by over 80% and further improving overall process judgment metrics.

Remaining challenges involve dependence on synthetic CoT annotation quality, additional computational overhead for RL-based methods, and the absence of extensions to video, 3D, or other richer modalities. Expansion of VisualPRM400K with more diverse domains and higher-fidelity human annotations, integration of ORM-PRM hybrids, exploration of lighter-weight critics, and extensions to other modalities are identified directions for future research (Wang et al., 13 Mar 2025, Chen et al., 5 Aug 2025).

7. Impact and Future Directions

VisualPRM and its successors instantiate a new paradigm for data-efficient, step-reward-based evaluation in multimodal reasoning. By bridging the multimodal reasoning gap for open-source MLLMs and enabling effective test-time selection strategies based on granular process supervision, VisualPRM supports systematic advances in model transparency, robustness, and performance. Future research is expected to build upon its stepwise, visually grounded reward modeling, extending to more diverse domains, richer multimodal contexts (including video), and integrating reward models directly into policy learning to enhance long-horizon, multimodal reasoning abilities (Wang et al., 13 Mar 2025, Chen et al., 5 Aug 2025).