Training Step-Level Reasoning Verifiers with Formal Verification Tools
(2505.15960v1)
Published 21 May 2025 in cs.CL
Abstract: Process Reward Models (PRMs), which provide step-by-step feedback on the reasoning generated by LLMs, are receiving increasing attention. However, two key research gaps remain: collecting accurate step-level error labels for training typically requires costly human annotation, and existing PRMs are limited to math reasoning problems. In response to these gaps, this paper aims to address the challenges of automatic dataset creation and the generalization of PRMs to diverse reasoning tasks. To achieve this goal, we propose FoVer, an approach for training PRMs on step-level error labels automatically annotated by formal verification tools, such as Z3 for formal logic and Isabelle for theorem proof, which provide automatic and accurate verification for symbolic tasks. Using this approach, we synthesize a training dataset with error labels on LLM responses for formal logic and theorem proof tasks without human annotation. Although this data synthesis is feasible only for tasks compatible with formal verification, we observe that LLM-based PRMs trained on our dataset exhibit cross-task generalization, improving verification across diverse reasoning tasks. Specifically, PRMs trained with FoVer significantly outperform baseline PRMs based on the original LLMs and achieve competitive or superior results compared to state-of-the-art PRMs trained on labels annotated by humans or stronger models, as measured by step-level verification on ProcessBench and Best-of-K performance across 12 reasoning benchmarks, including MATH, AIME, ANLI, MMLU, and BBH. The datasets, models, and code are provided at https://github.com/psunlpgroup/FoVer.
This paper, "Training Step-Level Reasoning Verifiers with Formal Verification Tools" (Kamoi et al., 21 May 2025), introduces FOVER, a novel approach for training Process Reward Models (PRMs) using automatically generated step-level error labels derived from formal verification tools. The core motivation is to address two significant limitations in current PRM development: the high cost and difficulty of obtaining accurate human-annotated step-level labels, and the limited generalization of PRMs, which are often primarily trained and evaluated on mathematical reasoning tasks.
FOVER tackles these challenges by leveraging the automatic and accurate verification capabilities of formal tools like Z3 (for formal logic) and Isabelle (for formal theorem proof). The key idea is that tasks compatible with these tools can have their reasoning steps verified programmatically, providing a source of large-scale, accurate step-level error labels without human intervention. Although the training data is synthesized only for these specific symbolic tasks, the trained LLM-based PRMs demonstrate cross-task generalization, showing improved reasoning verification capabilities across a broad range of out-of-distribution tasks.
The practical implementation of FOVER involves a dataset creation pipeline and the training of PRMs.
Dataset Creation Process:
Initial Response Generation:LLMs, specifically Llama 3.1 8B and Qwen 2.5 7B in this work, generate step-by-step solutions for formal logic and formal theorem proof problems.
For formal logic tasks, LLMs are directly prompted to generate formal solutions in a format compatible with Z3.
For formal theorem proof tasks, informal reasoning from the LLMs (based on math word problems from datasets like GSM8K, MetaMathQA, Big-Math) is first generated.
Informal to Formal Conversion (for Formal Theorem Proof): Since the base LLMs used are relatively small and struggle with complex Isabelle syntax, a stronger LLM (Llama 3.3 70B) is used to translate the informal reasoning steps into formal Isabelle proofs. This step relies on few-shot prompting and utilizes Isabelle's Sledgehammer tool to simplify the conversion by removing the need to specify supporting lemmas.
Automatic Error Annotation: Formal verification tools are applied to the generated formal solutions/proofs.
Formal Logic: Z3 is used. Each step is checked independently for logical validity given the preceding steps and premises.
Formal Theorem Proof: Isabelle/HOL is used. To obtain step-level verification (as Isabelle typically stops at the first error), a custom process inserts the sorry keyword into all steps except the one being checked, allowing for independent verification of each step's validity. The sorry keyword acts as a placeholder, letting Isabelle check the syntax and logical validity of the target step in isolation.
Dataset Assembly: The process yields a dataset of LLM responses with automatically annotated step-level error labels (correct/incorrect). The dataset is structured to provide balanced step-level labels for training by selectively masking steps (Table 2c, Appendix H.1), ensuring the PRM training sees a mix of correct and incorrect reasoning steps.
Training the PRMs:
The generated FOVER dataset is used to fine-tune base LLMs (Llama 3.1 8B and Qwen 2.5 7B) to act as PRMs.
The LLMs are fine-tuned to predict a binary label ("correct" or "incorrect") for each input step.
The input format involves presenting the problem context and the solution steps in a conversational turn-taking format, where the PRM is prompted after each step to provide a label (Appendix G.1).
Training is performed by fine-tuning all model parameters using the AdamW optimizer. Hyperparameters like learning rate are tuned based on performance on validation tasks (Appendix H.2).
Practical Applications and Evaluation:
The trained FOVER PRMs can be used as reasoning verifiers. The paper evaluates them in two standard settings:
Best-of-K Re-ranking: The PRM is used to score and rank multiple generated solutions (K=5) for a given problem. The solution with the highest score (typically the minimum step-level score across all steps in the solution) is selected as the "best" response. This evaluates the PRM's ability to distinguish better overall solutions based on step-level correctness. This is tested on 12 diverse reasoning benchmarks (Table 3, Appendix I.1).
Step-Level Binary Classification: The PRM's ability to correctly label individual steps as correct or incorrect is directly evaluated on datasets with human-annotated step-level labels, such as ProcessBench (Table 4).
Key Results and Practical Implications:
Automatic Annotation is Effective: Training PRMs on the FOVER dataset significantly improves their performance on diverse reasoning tasks compared to using the original base LLMs as verifiers (Figure 2, Table 3, Table 4). This demonstrates that labels from formal verification tools are a viable and effective alternative to human annotation.
Cross-Task Generalization: A surprising and notable finding is that PRMs trained only on formal logic and theorem proof tasks show improved verification capabilities across a wide range of out-of-distribution tasks, including informal math (GSM8K, MATH), logic (FOLIO, LogicNLI), NLI (ANLI, HANS), and multi-task understanding (MMLU-Pro-NoMath) (Table 3). This suggests that training on rigorous, symbolic tasks imparts generalizable reasoning verification skills.
Competitive Performance: FOVER PRMs achieve performance competitive with or superior to state-of-the-art PRMs trained on human-annotated or stronger-model-annotated labels (Table 3, Table 4), despite not using these expensive resources.
Genuine Improvement: Manual analysis confirms that FOVER genuinely improves step-level verification accuracy compared to baseline PRMs and rarely degrades it (Figure 6, Appendix J.3).
Implementation Considerations and Limitations:
Computational Cost: While training is relatively fast (around 30 minutes for an 8B model on four A100 GPUs), the dataset creation process, particularly step-level verification using Isabelle, is computationally intensive and time-consuming, taking weeks on multiple servers for the dataset size used (Appendix L). Z3 verification is faster.
Formal Tool Integration: Integrating formal verification tools like Z3 and Isabelle into the data generation pipeline requires expertise in these tools and careful handling of their input/output formats. Custom code is needed for step-level verification in Isabelle (Appendix F.3).
Informal-to-Formal Conversion: The reliance on a stronger LLM for converting informal math steps to formal Isabelle proofs is a dependency. While it works for the base LLMs used, applying this framework to state-of-the-art LLMs might require different strategies or more complex formalisms if those LLMs can directly generate formal proofs.
Scalability: The manual setup and computational cost of Isabelle verification could be a bottleneck for generating significantly larger datasets or applying the method to more complex formalisms.
Model Size: The paper focuses on 8B-class models due to computational constraints (Appendix A). Applying FOVER to larger, more capable LLMs might reveal different scaling properties or require training data derived from even more challenging problems to induce meaningful errors for verification.
Hyperparameter Tuning: The paper indicates that hyperparameter tuning, especially learning rate, is important but complex for cross-task generalization (Appendix J.2). More stable training strategies may be needed.
In summary, FOVER presents a practical and effective method for training PRMs by replacing expensive human annotation with automatic verification from formal tools. The observed cross-task generalization, even from symbolic training data, highlights the potential of this approach for developing general-purpose reasoning verifiers, offering a scalable alternative or complement to existing methods. The primary practical hurdle lies in the computational cost and technical complexity of integrating and running formal verification tools, particularly for theorem provers like Isabelle, at scale for dataset generation.