Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Published 31 Mar 2026 in cs.AI and cs.LG | (2603.29500v1)

Abstract: LLMs have recently demonstrated impressive performance on complex, multi-step reasoning tasks, especially when post-trained with outcome-rewarded reinforcement learning Guo et al. 2025. However, it has been observed that outcome rewards often overlook flawed intermediate steps, leading to unreliable reasoning steps even when final answers are correct. To address this unreliable reasoning, we propose PRoSFI (Process Reward over Structured Formal Intermediates), a novel reward method that enhances reasoning reliability without compromising accuracy. Instead of generating formal proofs directly, which is rarely accomplishable for a modest-sized (7B) model, the model outputs structured intermediate steps aligned with its natural language reasoning. Each step is then verified by a formal prover. Only fully validated reasoning chains receive high rewards. The integration of formal verification guides the model towards generating step-by-step machine-checkable proofs, thereby yielding more credible final answers. PRoSFI offers a simple and effective approach to training trustworthy reasoning models.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper presents SFI, a method where LLMs generate structured, machine-verifiable reasoning steps to ensure logical soundness.
It demonstrates that SFI achieves high answer accuracy while markedly improving reasoning reliability compared to outcome-only approaches.
Empirical evaluations on benchmarks like ProverQA show that SFI enhances generalization and robustness in step-by-step logical reasoning.

Learning to Generate Formally Verifiable Step-by-Step Logic Reasoning via Structured Formal Intermediaries

Motivation and Problem Formulation

LLMs exhibit strong performance on multi-step logical reasoning tasks, particularly when outcome-based rewards are leveraged during reinforcement learning. Conventional RL fine-tuning strategies provide supervision at the level of final answer correctness but systematically fail to enforce the reliability and logical soundness of intermediate reasoning steps. This discrepancy leads to models that can produce correct answers with flawed or invalid reasoning chains, which is problematic in mathematical and safety-critical reasoning domains.

The paper introduces a methodology—Structured Formal Intermediaries (SFI)—where LLMs generate stepwise structured reasoning aligned with natural language explanations. Each reasoning step is represented in a machine-verifiable intermediate format (e.g., JSON/YAML), and subsequently, a formal prover verifies each step. Only complete chains that pass all verifications receive maximal reward, enforcing reliability at a granular level.

Figure 1: Pipeline of SFI, contrasting stepwise formal verification with traditional outcome reward; even correct answers can be penalized for flawed intermediate logic.

Methodological Framework

Structured Formal Intermediates

Rather than prompting LLMs to emit full formal proofs (e.g., Lean 4 scripts), which remains impractical for models below 7B in scale, SFI proposes a lightweight structured format for intermediate steps:

Each step is atomic, encodes dependencies, formal conclusion, and the logical rule applied (e.g., Modus Ponens, Hypothetical Syllogism).
Steps are machine-parsable and serve as sub-problems for formal verification.
This decouples LLM generation from full proof construction, minimizing the structural misalignment observed in direct formal output.

Reward Construction

The reward function for RL post-training is accordingly nuanced:

$R=1.0$ : Answer correct, all steps verified.
$R=0.3$ : Answer correct, but some steps failed verification.
$R=0.1$ : Format correct, but answer incorrect.
$R=0.0$ : Format incorrect or other failures.

This fine-grained reward mechanism exploits the capability of formal provers to supply stepwise supervision and aligns model behavior toward credibly structured reasoning chains.

Empirical Evaluations

Performance on ProverQA

Experiments are conducted on ProverQA, a synthetic first-order logic benchmark with ground-truth stepwise reasoning annotations. The Qwen2.5-7B-Instruct serves as the base architecture, trained with the GRPO RL framework under various supervision protocols.

With outcome-based RL (Outcome-CoT), answer correctness surges to 91.3% on ProverQA-Hard but GPT Soundness remains low at 21.97%, showing poor reasoning faithfulness.
In contrast, SFI (referred to as ProSFI) RL yields similar answer accuracy (92.97%) but boosts soundness to 76.07%, marking a significant improvement in logical path validity.
Out-of-distribution tests on ProverQA-Extra confirm that SFI generalizes, achieving high soundness metrics even on unseen complexities.
Figure 2: GPT Soundness increases with the number of sampled reasoning paths under SFI; Outcome-CoT displays stagnant scaling due to lack of structure.

Comparison with Direct Formal Generation

Attempts to prompt direct Lean 4 code from 7B LLMs fail, with extremely low rewards and brittle, unstructured proofs. Compilation success does not translate to reliable reasoning trace alignment, indicating the necessity of intermediate scaffolding provided by SFI.

Test-Time Scaling and Robustness

SFI is compatible with test-time scaling protocols like Don't Trust; Verify (DTV), allowing for batch sampling of multiple paths and selection via formal verification. This mechanism is absent in outcome-only protocols where answer voting cannot filter unsound reasoning.

Metric Correlation and Reliability

Correlation analysis demonstrates that SFI reward hits align more closely with reasoning soundness than answer correctness alone, validating the efficacy of the stepwise verification reward model.

Figure 3: Strong correlation between SFI Reward Hit and GPT Soundness, illustrating the utility of formal step validation versus mere answer accuracy.

Logical Reasoning Generalization

Extensive tests on the Knights and Knaves dataset further show that SFI-based RL consistently improves logical consistency across varying difficulty levels, with OOD robustness for large problem sizes (up to 8 characters).

Analysis of Reasoning Path Soundness

Outcome-CoT confusion analyses highlight the problem: a considerable proportion of correct answers are underpinned by unsound intermediate reasoning steps, reinforcing the inadequacy of answer-level supervision.

Figure 4: Confusion matrix for Outcome-CoT reveals frequent misalignment between answer correctness and reasoning soundness; sound reasoning is not guaranteed by correct final answers.

SFI addresses these pitfalls, ensuring that verified, atomic logic is enforced throughout the reasoning chain. Typical failure modes in outcome-only setups include semantic confusion and commonsense shortcuts, both systematically mitigated by structured formal intermediaries.

Practical and Theoretical Implications

The proposed paradigm represents a practical shift from outcome-centric evaluations toward reasoning chain reliability. SFI's modular framework bridges natural language reasoning with formal logic, enabling:

Efficient RL for modest-sized LLMs utilizing stepwise formal feedback.
Facilitation of robust test-time scaling and improved faithfulness.
Generalization across domains (e.g., mathematical, algorithmic puzzles).
Integration with more advanced verification tools and domain-specific formal models.

Theoretically, enforcing machine-verifiable intermediates offers a path to scalable reliable AI reasoning, reducing the reliance on expensive human annotation and circumventing the computational bottlenecks of full formal proof generation in resource-constrained settings.

Future Directions

Extension of SFI to broader forms of reasoning, including general NLP and complex theorem proving, necessitating richer formal datasets and stronger base models.
Adoption of more granular reward structures for RL to further enhance credit assignment and learning stability.
Deep integration with evolving formal automated reasoning tools (e.g., DeepSeek-Prover, Z3, Isabelle) to increase coverage and rigor.

Conclusion

SFI introduces a reinforcement learning framework that leverages structured intermediate representations for step-level formal verification, enforcing logical soundness in LLM-generated reasoning chains without sacrificing final answer accuracy. Empirical results across multiple datasets underscore its ability to deliver credible, machine-verifiable reasoning that generalizes robustly, representing a significant advance in trustworthy logical reasoning with LLMs.

Markdown Report Issue