Reinforcement Learning from Proof Assistant Feedback

Updated 4 November 2025

RLPAF is a paradigm that leverages granular, symbolic proof assistant feedback to optimize large language models for theorem proving and formal verification.
It integrates candidate proof generation, detailed verification signals, and policy optimization (e.g., GRPO, PPO) to address the sparse reward challenge.
Empirical studies show RLPAF significantly improves proof success rates in systems like Lean, Isabelle, and HOL-based environments.

Reinforcement Learning from Proof Assistant Feedback (RLPAF) is a paradigm that leverages the granular, symbolic feedback returned by formal proof assistants to directly optimize LLMs for theorem proving and formal verification. In RLPAF frameworks, an LLM generates candidate proofs, and the proof assistant provides fine-grained binary or structured feedback (success, error types, verification state), which is used as the reward signal during policy optimization. This approach has proven pivotal for scaling automated theorem proving (ATP) in domains where correctness is syntactically and semantically strict, such as Lean, Isabelle, and HOL-based environments.

1. Conceptual Foundations and Motivation

RLPAF addresses the core bottleneck of formal theorem proving with LLMs: aligning neural proof generation with the rigid, mechanized standards of proof assistants. Traditional RL in sequence modeling relies on reward functions defined over trajectories in natural or formal language tasks. In RLPAF, the proof assistant provides dense, meaningful signals about the validity, type-correctness, and intermediate progress of generated proofs. The atomic reward is typically binary (1 for “verified proof,” 0 for failure), but more nuanced signals (such as tactic success/failure, error types, or partial progress) can be extracted depending on the system (Xin et al., 15 Aug 2024, Ji et al., 11 Jul 2025, Rao et al., 23 Apr 2025).

Motivations for RLPAF include:

Formal correctness is easily automatable, making it ideal for RL training.
Supervised fine-tuning (SFT) alone yields models susceptible to hallucinating plausible but unverified proofs.
RLPAF enables "self-correction" in LLMs, as models are exposed to their own failures and reward is only granted for proofs that pass formal verification.

2. RLPAF Workflow and Algorithms

A canonical RLPAF framework (e.g., DeepSeek-Prover-V1.5 or Leanabell-Prover-V2) implements the following workflow:

LLM produces $\mathcal{P} = \{p_1,\dots,p_N\}$ for a formal statement $s$ .
Proof assistant checks each $p_i$ : assigns reward $r_i = 1$ if $p_i$ is a fully valid proof (type-checks and solves the goal); else $r_i = 0$ .
Policy optimization is performed, e.g., using Group Relative Policy Optimization (GRPO, (Xin et al., 15 Aug 2024)) or DAPO (Ji et al., 11 Jul 2025). The objective is typically

$\mathcal{J}_\text{RL}(\theta) = \mathbb{E}_{(s, p) \sim \pi_\theta} [ r(s, p) ]$

Samples are grouped (e.g., batch/group of 32), and reward normalization is applied within group for advantage calculation: $\hat{A}_{i, t} = \frac{r_i - \operatorname{mean}( \{r_j\} ) }{ \operatorname{std}( \{r_j\} ) }$ This provides stable updates and mitigates the high-variance, sparse reward issue.
Policy update uses the actor-critic or PPO-like clipping, often at the token or proof level.

For proof-incremental models (step-wise or recursive frameworks), RLPAF can be integrated at each proof-step or subgoal, allowing localized credit assignment.

3. Two-Stage Training Protocol

Optimal RLPAF deployment uses a two-stage pipeline:

Stage	Input Data	Objective	Reward Source
SFT	(statement, gold proof)	Next-token prediction	No reward; cross-entropy
RLPAF	(statement, candidate proof(s))	Policy optimization	Proof assistant verification

Stage 1: Supervised Fine-Tuning (SFT) on high-quality proof corpora (e.g., DeepSeek-Prover's multi-million proof dataset, FVEL $_{ER}$ (Rao et al., 23 Apr 2025)).
Stage 2: RLPAF exposes the model to the environment (proof assistant feedback), with policy rollouts sampled, checked, and used to refine the model (see above).

This protocol is universally adopted in SOTA pipeline architectures (Xin et al., 15 Aug 2024, Ji et al., 11 Jul 2025, Rao et al., 23 Apr 2025).

4. Feedback Granularity and Error Correction

RLPAF’s distinctive power lies in the verifier-integrated feedback. Instead of mere binary success/failure, advanced systems (e.g., Leanabell-Prover-V2 (Ji et al., 11 Jul 2025), ProofAug (Liu et al., 30 Jan 2025)) feed error details and intermediate state back to the LLM. This enables:

Verifier-integrated RL ("multi-turn") in which each round, the model receives both its own output and the associated verifier message.
Explicit error correction: The model learns to repair failures based on precise, actionable feedback (e.g., syntax error at line $n$ , type mismatch, tactic failed).
Feedback token masking: During RL, only model-generated tokens (not error message tokens) contribute to the loss, preventing collusion or spurious learning (see (Ji et al., 11 Jul 2025)).

This drives models to “notice” failure and reflectively adjust output, yielding substantial gains in sample efficiency and pass rates even for small models.

5. Empirical Outcomes and Benchmark Results

RLPAF yields consistent empirical benefits:

miniF2F (Lean 4): DeepSeek-Prover-V1.5 (RLPAF) achieves $63.5\%$ on miniF2F test (Xin et al., 15 Aug 2024), compared to $48.2\%$ (SFT-only) and $50.0\%$ (RL on naive objective). Leanabell-Prover-V2 improves pass@128 by +2.0% over baseline (Ji et al., 11 Jul 2025).
ProofNet: RLPAF-based DeepSeek-Prover-V1.5 reaches $25.3\%$ on undergraduate math proofs.
Generalization: On real-world policy benchmarks (e.g., AWS S3), RLPAF-trained models verify 96% of policies (manual set) and 69.1% (LLM-generated set) (Rao et al., 23 Apr 2025)—outperforming non-RL baselines.

Moreover, RLPAF more than doubles efficiency relative to SFT at high sample budgets, and proves robust for both mathematical and practical (code, policy) verification.

6. System Integration and Broader Applications

RLPAF underpins nearly all modern LLM-based theorem proving frameworks targeting Lean, Isabelle, and related systems:

ProofSeek (Rao et al., 23 Apr 2025): Combines LLM synthesis, automated prover validation (Sledgehammer, Z3, Vampire), and a heuristic module based on ProofAug. RLPAF training after SFT supports robust proof search and efficient curation.
DeepSeek-Prover-V1.5 (Xin et al., 15 Aug 2024): RLPAF is central to both whole-proof and step-wise modes; integrated into the RMaxTS Monte-Carlo Tree Search for scalable, exploration-rich search.
Leanabell-Prover-V2 (Ji et al., 11 Jul 2025): Achieves verifier-integrated RL through multi-turn Lean interaction, explicit feedback masking, and DAPO optimization.
ProofAug (Liu et al., 30 Jan 2025): Plug-and-play recursive correction modules augment proof steps at arbitrary granularity based on verifier signals.

Such architectures have been exported to systems for security (policy language verification), agentic formal verification (Tredici et al., 14 Oct 2025), and experimental hybrid frameworks that mix symbolic and neural search.

7. Limitations and Open Problems

Despite clear successes, RLPAF currently exhibits characteristics and limitations:

Reward Sparsity: Even with elaborate search strategies, proof success rates remain bounded by the gap between model prior and environment requirements. Sparse reward can slow convergence on hard benchmarks.
Reward Hacking: Overfitting to the reward signal (“gaming” proof assistants with trivial but formally correct outputs) can arise, necessitating tight curation, realistic sample budgets, and domain shifts (e.g., FVEL $_{ER}$ filtering (Rao et al., 23 Apr 2025)).
Scaling Barriers: While RLPAF scales well to 7B–32B models, further efficiency at 100B+ scale is gated by dataset diversity and verifier throughput.
Integration Cost: Multi-turn, verifier-in-the-loop RL remains computationally intensive, especially for iterative correction protocols.

Possible advancements include deploying fine-grained process rewards (step/trace-level), improved error signal utilization, and direct reward learning from critiques or value-guided critics.

8. Summary Table: RLPAF Design Patterns

Component	Description	Example
Proof feedback	Proof assistant returns pass/fail, error type, or state	Lean 4 verifier
RL policy update	Group-based advantage normalization (GRPO), DAPO for token-level PPO	DeepSeek-Prover
Feedback granularity	Single binary, stepwise, or multi-turn error reflection	Leanabell V2
Reward masking	Masking verifier tokens in loss	Leanabell V2
Automated fallback	On failure, invoke ATP/ERP/heuristics, else revert to prior valid block	ProofSeek, ProofAug

9. Impact and Outlook

RLPAF is foundational to sample-efficient, robust, and verifiable neural theorem proving. By shifting reward assignment from sequence-level prediction to direct evaluation against mechanized correctness, it enables "reflective" AI mathematicians—models that can learn from their mistakes via explicit feedback. While its initial application was in synthetic mathematical domains, RLPAF is now a cross-cutting methodology for code verification, formal policy analysis, and agentic scientific reasoning.

Current research is focused on enhancing RLPAF with self-correction loops, fine-grained process rewards, and hybrid proof-state critics, and on scaling agentic tool-use architectures that combine RLPAF-powered LLMs with explicit tool invocation and environment search (Tredici et al., 14 Oct 2025, Liu et al., 30 Jan 2025). The paradigm will likely remain central as ATP advances toward more general and trustworthy applications.