ConfProBench: Benchmark for MPJ Confidence
- ConfProBench is a comprehensive benchmark that evaluates the reliability of confidence scores from MLLM-based Process Judges using controlled adversarial perturbations.
- The benchmark introduces three novel metrics—CRS, CSS, and CCS—to assess robustness, sensitivity to error types, and calibration, providing fine-grained insights into model performance.
- Empirical results reveal differences in performance between proprietary and open-source models, highlighting key challenges in managing overconfidence and error detection in multimodal reasoning.
ConfProBench is a comprehensive benchmark designed to rigorously evaluate the reliability of step-level confidence scores generated by Multimodal LLM (MLLM)-based Process Judges (MPJs). MPJs have become central in assessing the correctness of intermediate reasoning steps in complex multimodal tasks. While traditional benchmarks focused on accuracy in step classification or error-type identification, ConfProBench uniquely targets the robustness, sensitivity, and calibration of the confidence outputs, addressing a critical gap for downstream applications such as model calibration, risk assessment, and active learning (Zhou et al., 6 Aug 2025).
1. Motivation and Scope
ConfProBench was motivated by the proliferation of MPJs, which automatically classify each reasoning step in multimodal problem-solving pipelines and output a probabilistic confidence value. Reliable step-level confidence is essential—it directly informs automated trust calibration procedures, triggers corrective mechanisms (e.g., human intervention, re-generation), and underpins advanced workflows in multimodal QA systems. Previous benchmarks such as VisualProcessBench, MPBench, and ProJudgeBench provided step correctness and error identification metrics but did not systematically evaluate the quality of confidence estimates, particularly their robustness to minor perturbations and their calibration across diverse reasoning errors and modalities. ConfProBench directly addresses this shortfall by introducing a controlled suite of adversarial perturbations and novel confidence metrics (Zhou et al., 6 Aug 2025).
2. Dataset Construction and Perturbation Protocols
ConfProBench builds upon a 1,200-problem subset of ProJudgeBench, sampled to ensure diverse coverage: three difficulty levels (Middle-School, High-School, Competition), four science disciplines, three input types (single image, multi image, pure text), and seven error categories. Each problem includes student-generated solutions decomposed into reasoning steps, annotated with gold correctness and error types.
To test confidence robustness, three semantically preserving adversarial perturbation protocols are introduced:
- Synonym Substitution: For each step, GPT-4o produces five versions by replacing non-technical terms with semantically equivalent synonyms (e.g., "add" → "combine").
- Syntactic Transformation: GPT-4o generates five rewrites by applying structural changes such as voice alternation, adverbial repositioning, clause order swaps, phrase expansion, inversion/emphasis, and conditional restructuring (e.g., "From the equation, we isolate x" → "x is isolated from the equation").
- Image Perturbation: Each image is perturbed by a random low-level transformation (scaling, rotation, Gaussian noise, or color inversion) to alter appearance without changing underlying visual semantics.
All perturbed samples undergo manual verification to ensure semantic fidelity, thus isolating MPJ confidence responses to controlled, non-destructive input changes.
3. Evaluation Metrics: CRS, CSS, and CCS
ConfProBench introduces three complementary metrics for a multidimensional evaluation of confidence assignment:
3.1 Confidence Robustness Score (CRS)
CRS assesses stability of confidence under perturbation. Formally, let and denote the step confidences for original and perturbed inputs, respectively. Three sub-metrics are computed:
- Change Rate (CCR):
- Average Change Magnitude (ACCM):
- Significant Change Rate (SCCR):
CRS is an aggregate:
Higher CRS implies greater robustness to input changes.
3.2 Confidence Sensitivity Score (CSS)
CSS quantifies the drop in confidence in presence of specific error types. For error type ,
where is mean confidence on true steps, and is mean on steps with error type 0. Aggregate sensitivity is
1
Larger positive CSS indicates improved error-type distinguishability.
3.3 Confidence Calibration Score (CCS)
Calibration is measured using Expected Calibration Error (ECE) in 2 bins:
3
with class-specific ECEs 4, 5, and inter-class gap
6
Combined as
7
A CCS of 1 is ideal, signifying perfect calibration and parity across correctness classes.
4. Model Coverage and Experimental Methodology
ConfProBench's evaluation covers 14 state-of-the-art MPJs:
- Proprietary: GPT-4o, GPT-4o-Mini, GPT-4.1, Gemini-2.5-flash (with and without chain-of-thought)
- Open-source: InternVL3-8B/14B/38B, Qwen2.5-VL-3B/7B/32B/72B, MiniCPM-V-2_6, QVQ-72B
For each model, every reasoning step 8 receives a confidence 9. Binary correctness labels 0 are thresholded at 0.5, with final confidence 1 computed as
2
Perturbed and original confidences are compared by all metrics. Experiments are run on the full benchmark, with each third corresponding to one perturbation class.
5. Empirical Results
Analysis of ConfProBench evaluations yields the following findings:
- Confidence Robustness (CRS): Qwen2.5-VL-32B leads among open-source models (CRS ≈ 0.81), followed by InternVL3-8B (0.77). GPT-4.1 is the top proprietary (0.74). Many open-source models surpass flagship proprietary systems in robustness, indicating that sheer model scale does not guarantee optimal confidence robustness.
- Robustness Breakdown: Qwen2.5-VL-32B achieves the lowest CCR (15.8%), ACCM (6.3%), and near-zero SCCR (0.04%). In contrast, InternVL3-38B is least robust (CCR ≈ 61.7%, ACCM ≈ 9.3%, SCCR ≈ 6.9%).
- Confidence Sensitivity (CSS): Proprietary models excel: Gemini-2.5-flash (48.3), its no-thinking variant (42.1), and GPT-4.1 (38.5). Some open-source MPJs exhibit negative sensitivity on Question Understanding Errors (e.g., MiniCPM-V-2_6, –21.6), suggesting pathologically high confidence in the presence of certain mistakes.
- Confidence Calibration (CCS): GPT-4o achieves the best calibration (62.0), followed by GPT-4.1 (37.7) and Gemini variants (48–51). Several open-source models are poorly calibrated (MiniCPM-V-2_6: CCS ≈ –48), with serious overconfidence on erroneous steps (ECE_incorrect ≫ ECE_correct).
- Aggregate Ranking: Gemini-2.5-flash attains the highest mean across CRS/CSS/CCS (≈53.3), with GPT-4o and GPT-4.1 close behind. Open-source models cluster between 30–46, and performance is not strictly monotonic in parameter count.
- Scale & Reasoning Mode: Scaling model size does not universally improve robustness or calibration, though CSS and CCS show moderate positive trends. Enabling chain-of-thought reasoning in Gemini boosts CRS and CSS but slightly diminishes CCS, indicating possible tradeoffs between these properties.
6. Implications and Future Work
ConfProBench’s results reveal that even leading MPJs demonstrate imperfect robustness to semantically neutral perturbations, uneven error-type sensitivity, and persistent calibration shortcomings—especially for incorrectly judged steps. The benchmark and its three metrics (CRS, CSS, CCS) provide new baselines and a multidimensional foundation for future research (Zhou et al., 6 Aug 2025).
Planned directions include the collection of human-annotated confidence judgments to align MPJ outputs with expert uncertainty estimates, extension of ConfProBench into safety-critical application domains (e.g., medical diagnosis, autonomous driving), and the investigation of model- or training-level interventions (such as adversarial confidence regularization) targeting simultaneous improvements in robustness, sensitivity, and calibration. These efforts are expected to drive further progress in trustworthy reasoning with MLLMs.