FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models

Published 11 May 2026 in cs.AI | (2605.10141v1)

Abstract: Recent neural theorem provers use reinforcement learning with verifiable rewards (RLVR), where proof assistants provide binary correctness signals. While verifiable rewards are cheap and scalable without reward hacking issues, they suffer from sparse credit assignment: models receive no learning signal from difficult problems where partial progress goes unrewarded. This motivates learned reward models that can evaluate proof quality beyond binary verification. However, comparing reward models is challenging since it typically requires expensive RL training ablations. To address this, we introduce \textbf{FormalRewardBench}, the first benchmark for evaluating reward models in formal theorem proving with Lean 4. Our benchmark consists of 250 preference pairs where correct proofs are paired with incorrect variants generated through five expert curated error injection strategies: forced mistakes, minimal single-point variations, verbose incorrect proofs, natural language justification, and Python code injection. We evaluate frontier LLMs (e.g., Claude Opus 4.5), judge LLMs (e.g., CompassJudger-1-14B), general-purpose LLMs (e.g., Qwen2.5-72B-Instruct), and specialized theorem proving models (e.g., DeepSeek-Prover-V2-7B). Our results reveal that frontier LLMs achieve the highest performance (59.8\%) while specialized theorem provers perform the worst (24.4\%), suggesting that theorem proving ability does not transfer to proof evaluation. We provide further insights on various error injection mechanisms, highlighting the challenging nature of most injection mechanisms. We release \textbf{FormalRewardBench} publicly to encourage more research on developing reward models in formal mathematics.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents FormalRewardBench, a benchmark designed to evaluate fine-grained reward models for formal theorem proving using expert-curated error strategies.
It uses 250 preference pairs with diverse error injection methods to simulate realistic LLM failure modes and assess evaluation accuracy.
Empirical results reveal that general-purpose judge LLMs outperform specialized provers, exposing a generation-evaluation gap in formal proof verification.

FormalRewardBench: A Benchmark for Reward Models in Formal Theorem Proving

Motivation and Context

Recent advances in neural formal theorem proving are primarily driven by RLVR (Reinforcement Learning with Verifiable Rewards), leveraging binary correctness signals output by proof assistants. While these signals are naturally verifiable and operationally scalable, their binary nature hinders granular learning due to extremely sparse credit assignment: all non-successful attempts, regardless of partial progress, are penalized identically. This motivates the creation of learned reward models capable of producing fine-grained evaluations of proof quality. However, prior attempts to evaluate such reward models for formal theorem proving have been stymied by the high cost of RL training runs required for empirical comparison. The paper "FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models" (2605.10141) addresses this gap by introducing a dedicated, controlled benchmark—FormalRewardBench—that enables direct, cost-effective assessment of reward models for formal mathematical proofs in Lean 4.

Benchmark Construction and Methodology

FormalRewardBench comprises 250 preference pairs, constructed via expert-curated error injection into Lean 4 proofs sourced from MiniF2F, a diverse, olympiad-level mathematics dataset. For each pair, a correct, machine-verified proof is accompanied by an incorrect variant generated using one of five distinct strategies designed to emulate realistic LLM failure modes:

Minimal Single-Point Variations: Subtle yet semantically decisive modifications (e.g., hypothesis swaps, operator misselection) that preserve syntactic plausibility.
Natural Language Justification: Augmentation with misleading natural language commentary, exploiting verbosity bias in LLM judges.
Python Code Injection: Replacement of Lean proof steps with valid Python code, highlighting the distinction between computational and formal correctness.
Forced LLM Mistakes: Injection of errors typical of contemporary LLMs, such as wrong tactic application or improper lemma instantiation.
Verbose Incorrect Proofs: Elaboration of superficially sophisticated proofs whose extended reasoning masks fundamental flaws.

Each incorrect proof is filtered for syntactic validity, typechecker rejection, and non-triviality, ensuring that distinction from the correct proof demands actual comprehension of formal semantics rather than pattern-matching or shallow heuristics. Proof generation employs Claude Opus 4.5 to avoid bias toward any single prover system and to maximize the coverage of conceivable error modes.

Evaluation Protocol and Model Selection

FormalRewardBench adopts both pointwise and pairwise evaluation paradigms. In the pointwise setting, reward models independently score each proof, requiring $r(P_{\text{correct}}) > r(P_{\text{incorrect}})$ . In the pairwise setting, the model is prompted to explicitly choose between the two proofs, with results reported under position-consistent protocols to control for order bias.

Model categories evaluated include:

Frontier LLMs (Claude Opus 4.5, GPT-5.2, Gemini 2.5 Flash, etc.)
Judge LLMs (CompassJudger, Selene, LMUnit, Skywork-Critic, RISE-Judge)
General-Purpose Math/Code LLMs (Qwen2.5-72B-Instruct, DeepSeek-Coder-V2-Lite)
Theorem Proving Specialists (DeepSeek-Prover, Gödel-Prover, etc.)

No fine-tuning or specialized prompt engineering is applied, ensuring a fair, reproducible comparison.

Empirical Results

The primary findings are summarized as follows:

Frontier LLMs outperform all other classes in both pointwise and pairwise settings, with Claude Opus 4.5 achieving 70.1% pointwise and 59.8% pairwise accuracy.
Judge LLMs perform substantially better than specialized theorem proving models (52.8% vs 12.8% pointwise accuracy among open weights), despite not being trained on Lean or proof data.
Domain-specific theorem provers exhibit the lowest accuracy (as low as ~0–13% pointwise, 0–9% pairwise), indicating a lack of transfer from proof generation ability to proof evaluation competence—a generation-evaluation gap.
Model size is not the primary driver: Con-J-Qwen2-7B (7B parameters) achieves 52.8% pointwise, outperforming multiple 70B+ models; training objective and data are determinative.
Error injection strategies create a clear difficulty hierarchy: Most models perform near-perfectly at detecting Python code replacement but are challenged by verbose incorrect proofs and forced LLM mistakes, highlighting the benchmark’s capacity to differentiate between shallow and deep formal reasoning.
Substantial order (position) biases exist in pairwise judgment, especially among specialized provers (e.g., DeepSeek-Prover-V2-7B: 94.9% selection when the correct proof is first, but only 9.8% when second).

Analysis and Implications

The most striking outcome is that specialization for proof generation does not translate to effective proof evaluation. Specialized provers, trained via SFT/RL on correct proofs, have not internalized robust error detection or grading mechanisms; this suggests that their representations are not suited for critique, only constructive synthesis. In contrast, judge LLMs and frontier generalists, exposed to broad preference data and trained for discernment rather than construction, generalize preference-evaluation capabilities to the formal domain—even in the absence of Lean-specific data.

This has substantial implications for both methodology and system design in formal mathematics and software verification:

Reward modeling for formal reasoning will likely require data and objectives explicitly involving error detection and critique, not merely reinforcement on correct solutions.
The generation-evaluation gap mirrors analogous deficiencies in code LLMs, where writing code does not imply debugging proficiency.
Improvements in reward models on FormalRewardBench may enable denser, more informative credit assignment in RL for theorem proving, potentially accelerating progress in autoformalization and certified mathematics.
There is an evident need for hybrid pipelines where learned reward models augment (but do not replace) formal verification, particularly for inference-time proof selection, curriculum construction, or adversarial example mining.

Future Directions

Future research is suggested in the following areas:

Development of robust, fine-grained reward models trained on explicit proof evaluation data—including negative and partially correct examples—which could improve RL training and proof selection schemes for neural provers.
Process-level evaluation: Extending the benchmark to handle step-by-step, intermediate proof state comparisons for even finer feedback.
Benchmark and evaluation expansion: Adapting FormalRewardBench methodologies to other proof assistants such as Coq or Isabelle, and to domains beyond mathematics, including program verification and logic.
Systematic mitigation of position bias in judge LLM architectures and inference procedures.

Conclusion

FormalRewardBench (2605.10141) establishes a foundational, controlled platform for evaluating reward models in formal theorem proving. Its comprehensive construction using expert-designed error strategies reveals the limitations of current generation-focused approaches, particularly the failure of specialized provers to robustly evaluate proof correctness. The benchmark elucidates the transferability of preference-based LLM training and underscores the need for direct, critical evaluation objectives in future formal reasoning systems. Public availability of the benchmark promises to catalyze progress in the development of reliable, informative reward modeling tools for mathematics, logic, and beyond.

Markdown Report Issue