- The paper presents FormalRewardBench, a benchmark designed to evaluate fine-grained reward models for formal theorem proving using expert-curated error strategies.
- It uses 250 preference pairs with diverse error injection methods to simulate realistic LLM failure modes and assess evaluation accuracy.
- Empirical results reveal that general-purpose judge LLMs outperform specialized provers, exposing a generation-evaluation gap in formal proof verification.
Motivation and Context
Recent advances in neural formal theorem proving are primarily driven by RLVR (Reinforcement Learning with Verifiable Rewards), leveraging binary correctness signals output by proof assistants. While these signals are naturally verifiable and operationally scalable, their binary nature hinders granular learning due to extremely sparse credit assignment: all non-successful attempts, regardless of partial progress, are penalized identically. This motivates the creation of learned reward models capable of producing fine-grained evaluations of proof quality. However, prior attempts to evaluate such reward models for formal theorem proving have been stymied by the high cost of RL training runs required for empirical comparison. The paper "FormalRewardBench: A Benchmark for Formal Theorem Proving Reward Models" (2605.10141) addresses this gap by introducing a dedicated, controlled benchmark—FormalRewardBench—that enables direct, cost-effective assessment of reward models for formal mathematical proofs in Lean 4.
Benchmark Construction and Methodology
FormalRewardBench comprises 250 preference pairs, constructed via expert-curated error injection into Lean 4 proofs sourced from MiniF2F, a diverse, olympiad-level mathematics dataset. For each pair, a correct, machine-verified proof is accompanied by an incorrect variant generated using one of five distinct strategies designed to emulate realistic LLM failure modes:
- Minimal Single-Point Variations: Subtle yet semantically decisive modifications (e.g., hypothesis swaps, operator misselection) that preserve syntactic plausibility.
- Natural Language Justification: Augmentation with misleading natural language commentary, exploiting verbosity bias in LLM judges.
- Python Code Injection: Replacement of Lean proof steps with valid Python code, highlighting the distinction between computational and formal correctness.
- Forced LLM Mistakes: Injection of errors typical of contemporary LLMs, such as wrong tactic application or improper lemma instantiation.
- Verbose Incorrect Proofs: Elaboration of superficially sophisticated proofs whose extended reasoning masks fundamental flaws.
Each incorrect proof is filtered for syntactic validity, typechecker rejection, and non-triviality, ensuring that distinction from the correct proof demands actual comprehension of formal semantics rather than pattern-matching or shallow heuristics. Proof generation employs Claude Opus 4.5 to avoid bias toward any single prover system and to maximize the coverage of conceivable error modes.
Evaluation Protocol and Model Selection
FormalRewardBench adopts both pointwise and pairwise evaluation paradigms. In the pointwise setting, reward models independently score each proof, requiring r(Pcorrect​)>r(Pincorrect​). In the pairwise setting, the model is prompted to explicitly choose between the two proofs, with results reported under position-consistent protocols to control for order bias.
Model categories evaluated include:
- Frontier LLMs (Claude Opus 4.5, GPT-5.2, Gemini 2.5 Flash, etc.)
- Judge LLMs (CompassJudger, Selene, LMUnit, Skywork-Critic, RISE-Judge)
- General-Purpose Math/Code LLMs (Qwen2.5-72B-Instruct, DeepSeek-Coder-V2-Lite)
- Theorem Proving Specialists (DeepSeek-Prover, Gödel-Prover, etc.)
No fine-tuning or specialized prompt engineering is applied, ensuring a fair, reproducible comparison.
Empirical Results
The primary findings are summarized as follows:
- Frontier LLMs outperform all other classes in both pointwise and pairwise settings, with Claude Opus 4.5 achieving 70.1% pointwise and 59.8% pairwise accuracy.
- Judge LLMs perform substantially better than specialized theorem proving models (52.8% vs 12.8% pointwise accuracy among open weights), despite not being trained on Lean or proof data.
- Domain-specific theorem provers exhibit the lowest accuracy (as low as ~0–13% pointwise, 0–9% pairwise), indicating a lack of transfer from proof generation ability to proof evaluation competence—a generation-evaluation gap.
- Model size is not the primary driver: Con-J-Qwen2-7B (7B parameters) achieves 52.8% pointwise, outperforming multiple 70B+ models; training objective and data are determinative.
- Error injection strategies create a clear difficulty hierarchy: Most models perform near-perfectly at detecting Python code replacement but are challenged by verbose incorrect proofs and forced LLM mistakes, highlighting the benchmark’s capacity to differentiate between shallow and deep formal reasoning.
- Substantial order (position) biases exist in pairwise judgment, especially among specialized provers (e.g., DeepSeek-Prover-V2-7B: 94.9% selection when the correct proof is first, but only 9.8% when second).
Analysis and Implications
The most striking outcome is that specialization for proof generation does not translate to effective proof evaluation. Specialized provers, trained via SFT/RL on correct proofs, have not internalized robust error detection or grading mechanisms; this suggests that their representations are not suited for critique, only constructive synthesis. In contrast, judge LLMs and frontier generalists, exposed to broad preference data and trained for discernment rather than construction, generalize preference-evaluation capabilities to the formal domain—even in the absence of Lean-specific data.
This has substantial implications for both methodology and system design in formal mathematics and software verification:
- Reward modeling for formal reasoning will likely require data and objectives explicitly involving error detection and critique, not merely reinforcement on correct solutions.
- The generation-evaluation gap mirrors analogous deficiencies in code LLMs, where writing code does not imply debugging proficiency.
- Improvements in reward models on FormalRewardBench may enable denser, more informative credit assignment in RL for theorem proving, potentially accelerating progress in autoformalization and certified mathematics.
- There is an evident need for hybrid pipelines where learned reward models augment (but do not replace) formal verification, particularly for inference-time proof selection, curriculum construction, or adversarial example mining.
Future Directions
Future research is suggested in the following areas:
- Development of robust, fine-grained reward models trained on explicit proof evaluation data—including negative and partially correct examples—which could improve RL training and proof selection schemes for neural provers.
- Process-level evaluation: Extending the benchmark to handle step-by-step, intermediate proof state comparisons for even finer feedback.
- Benchmark and evaluation expansion: Adapting FormalRewardBench methodologies to other proof assistants such as Coq or Isabelle, and to domains beyond mathematics, including program verification and logic.
- Systematic mitigation of position bias in judge LLM architectures and inference procedures.
Conclusion
FormalRewardBench (2605.10141) establishes a foundational, controlled platform for evaluating reward models in formal theorem proving. Its comprehensive construction using expert-designed error strategies reveals the limitations of current generation-focused approaches, particularly the failure of specialized provers to robustly evaluate proof correctness. The benchmark elucidates the transferability of preference-based LLM training and underscores the need for direct, critical evaluation objectives in future formal reasoning systems. Public availability of the benchmark promises to catalyze progress in the development of reliable, informative reward modeling tools for mathematics, logic, and beyond.