VisualProcessBench: Multimodal Reasoning Benchmark

Updated 5 December 2025

VisualProcessBench is a multimodal benchmark delivering fine-grained, human-annotated evaluation of step-level reasoning in vision-language tasks.
It employs detailed annotations and robust metrics—such as macro-F1 and Best-of-N reranking—to precisely identify reasoning errors in complex problem-solving.
The benchmark drives advancements in test-time scaling and PRM development by enabling systematic comparison and rigorous error analysis in multimodal AI.

VisualProcessBench is a multimodal benchmark specifically designed to evaluate step-level reasoning correctness in visual LLMs and Process Reward Models (PRMs) for complex, multi-step vision-language reasoning tasks. It provides fine-grained, human-annotated labels on individual reasoning steps, enabling systematic measurement of PRM performance as intermediate “critics” in visual chain-of-thought (CoT) scenarios. It has become a crucial tool for advancing test-time scaling methods, PRM development, and rigorous error analysis in multimodal AI research (Wang et al., 13 Mar 2025, Wang et al., 11 Jun 2025).

1. Motivation and Benchmark Scope

The primary objective of VisualProcessBench is to measure the ability of PRMs and Multimodal LLMs (MLLMs) to accurately assess the correctness of each step in a visual CoT solution, rather than merely identifying the first erroneous step. This distinction is critical for robust performance in advanced test-time scaling (TTS) and Best-of-N (BoN) reranking workflows, where precise detection of all flawed reasoning steps supports downstream selection of optimal candidate solutions. Previous benchmarks only located the first error, underestimating detection difficulty when AI systems demonstrate self-reflection or produce plausible but faulty multi-step chains.

It addresses the need for a multimodal, human-annotated standard, exposing current models’ strengths and weaknesses in visuo-linguistic stepwise judgment and supporting systematic comparison across PRMs, ORMs, and alternative critic strategies (Wang et al., 13 Mar 2025).

2. Dataset Structure and Annotation Protocol

VisualProcessBench is constructed from five visual math-reasoning sources: MMMU (multidisciplinary), MathVision, MathVerse (vision-only split), DynaMath, and WeMath. Each sample comprises an image, a problem prompt, the ground-truth answer, and a full solution decomposed into “natural” steps (delimited by blank lines or explicit separators). The dataset is summarized as follows:

Subset	Problems	Annotated Steps	Solution Sources
MMMU	267	–	GPT-4o, Claude-3.5, QvQ-72B, InternVL2.5-78B
MathVision	712	–	As above
MathVerse	1,026	–	As above
DynaMath	570	–	As above
WeMath	291	–	As above
Total	2,866	26,950	–

Each step is annotated with one of three human-defined labels by expert annotators:

“+”: Correct and contributes valid reasoning
“−”: Incorrect due to logical or factual error
“o”: Neutral—provides no reasoning or purely formatting (ignored in evaluation)

Thirteen annotators (≥ university degree) completed 39 person-days of labeling, with 10% spot-checking per batch and a policy allowing annotators to skip incomprehensible steps. Solutions were originally generated by current frontier MLLMs, increasing domain realism but limiting diversity to model-generated chains (Wang et al., 13 Mar 2025).

3. Evaluation Metrics and Protocols

VisualProcessBench enables both step-wise and end-to-end evaluations for critic models:

Single-Pass Step Judgment: For each annotated step, the critic model produces a binary probability (“+” vs. “−”); “o” labels are excluded. The following definitions apply:
- Precision and Recall per label:
$P = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}, \quad R = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$ - F1 per label:

$F1 = \frac{2PR}{P+R}$ - Macro-F1:

$\mathrm{Macro\mbox{-}F1} = \frac{1}{2}(F1_{+} + F1_{-})$ - Micro-averaged F1 (as used by Athena-PRM):

$F1_{\text{micro}} = \frac{2 \cdot TP_{\text{all}}}{2 \cdot TP_{\text{all}} + FP_{\text{all}} + FN_{\text{all}}}$
Best-of-N (BoN) Reranking: The MLLM generates $N$ candidate solutions; the critic model scores and reranks based on mean step-level rewards, with the highest-scoring chain’s answer submitted for task evaluation. BoN performance is measured by task-level accuracy over the seven constituent benchmarks.

This dual evaluation paradigm supports direct comparison between PRMs, ORMs, self-consistency, and fallback baseline models (Wang et al., 13 Mar 2025, Wang et al., 11 Jun 2025).

4. Baseline Model Performance

VisualProcessBench provides detailed open-source and proprietary baseline results for single-pass step-level judgment (macro-F1) and BoN-8 accuracy. Selected results are summarized below:

Model	Macro-F1 (%)
Random Guessing	50.0
GPT-4o-Mini	57.9
GPT-4o	60.3
Gemini-2.0-Flash	62.3
VisualPRM-8B	62.0
Athena-PRM (Qwen2.5-VL-7B)	65.9
Qwen2.5-VL-72B	60.5
InternVL2.5-78B	52.6

Athena-PRM, using data-efficient consistency-based labeling and ORM initialization, establishes a new state-of-the-art macro-F1 (65.9%, +3.9 over VisualPRM-8B) and demonstrates marked gains especially in the MMMU and WeMath subsets (+15.6 and +12.0 F1, respectively) (Wang et al., 11 Jun 2025). Ablation studies indicate incremental improvements from ORM initialization, filtered data, and negative up-sampling.

Open-source MLLMs exhibit high recall but poor precision on incorrect steps, often failing to localize errors precisely, while PRMs narrowed this gap. VisualPRM and Athena-PRM, utilizing advanced label filtering and negative mining, approach or surpass proprietary models such as Gemini-2.0-Flash in macro-F1.

5. Critical Properties and Limitations

VisualProcessBench is characterized by:

Multidomain coverage: spanning multidisciplinary, mathematical, and logic reasoning over images.
Rigorous human annotation: stepwise supervision from expert annotators ensures label quality and detection of nuanced logical flaws.
Granular step correctness: evaluation focuses on exact identification of all erroneous reasoning steps, not solely the first failure.
Flexible error analysis: metrics support macro- and micro-averaged F1 and ablation of specific error types/difficulties.

Limitations include:

Annotation cost and dataset scale (≈2.9K questions, 27K steps).
Solutions derived from a limited set of high-performing MLLMs, potentially restricting reasoning diversity.
Neutral/formatting steps are ignored, and partial credit for intra-step errors is not awarded.
The focus is exclusively on reasoning step correctness, not on answer span scoring or generative quality (Wang et al., 13 Mar 2025).

6. Integration Within the Multimodal Reasoning Ecosystem

VisualProcessBench catalyzed the development and evaluation of leading PRMs, including VisualPRM (Wang et al., 13 Mar 2025) and Athena-PRM (Wang et al., 11 Jun 2025). It enables:

Systematic benchmark-driven comparison between critic architectures.
Support for advanced TTS and BoN reranking pipelines, contributing to higher final task accuracy (e.g., Athena-PRM yielded +10.2 points on WeMath in TTS).
Fine-grained diagnostic studies, exposing current model failure modes (e.g., systematic overprediction of correctness by open models).
Facilitating robust reward modeling, crucial for reinforcement learning from reasoning processes (as opposed to pure outcome reward).

Related resources such as VisualPRM400K (for PRM training) and alternate visual prompting benchmarks (e.g., VP-Bench for visual cue interpretation) address complementary facets of multimodal reasoning and grounding, but VisualProcessBench uniquely targets step-level visuo-linguistic CoT verification (Wang et al., 13 Mar 2025, Xu et al., 14 Nov 2025).

7. Access and Reproducibility

VisualProcessBench and associated PRMs/datasets are publicly available at https://internvl.github.io/blog/2025-03-13-VisualPRM/, with version 1.0 stable as of March 2025. This accessibility supports reproducible research and rapid iteration on visuo-linguistic reward models and evaluation protocols. Users are encouraged to cite Wang et al. (CVPR 2025) for benchmark usage (Wang et al., 13 Mar 2025).