Discover And Prove (DAP) Framework

Updated 4 July 2026

DAP is a two-stage formal reasoning paradigm that separates discovery of critical objects or helper lemmas from subsequent formal proof construction.
In Hard Mode ATP, DAP transforms complex theorems by using an iterative discovery module with self-verification to generate essential solutions before proof search.
In program verification, DAP integrates formal methods with lemma synthesis, bridging program semantics with verification conditions to enhance automation.

Searching arXiv for the cited DAP papers and closely related work to ground the article. arXiv search: (Liu et al., 17 Apr 2026) "Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4" Discover And Prove (DAP) denotes a two-stage formal-reasoning pattern in which a system first discovers a latent object needed for the proof—such as a missing answer or a helper lemma—and then constructs a machine-checked proof relative to that discovery. In recent work, the term is used explicitly in two settings: an open-source agentic framework for “Hard Mode” automated theorem proving in Lean 4, and a program-verification formulation in which helper lemmas and proofs are jointly synthesized for verification conditions. A plausible antecedent is deductive theory exploration, where equational lemmas are conjectured symbolically and then proved in a feedback-driven loop, even though that earlier work does not use the DAP name (Liu et al., 17 Apr 2026, Zhao et al., 23 Mar 2026, Singher et al., 2020).

1. Conceptual scope and problem decomposition

DAP is defined by a separation between discovery and proof. In the Lean 4 theorem-proving formulation, the discovery target is the critical solution ans that a human solver would have to derive before the final formal argument can even be stated. The framework then rewrites the original problem into a conventional form solvable by existing ATP systems. In the program-verification formulation, the discovery target is a set of helper lemmas $L$ that are not explicitly present in the proof-targeted verification conditions but are needed to discharge them in Coq/Rocq (Liu et al., 17 Apr 2026, Zhao et al., 23 Mar 2026).

This division is not merely organizational. The Hard Mode ATP work states that standard benchmarks often embed the final answer within the formal statement, a convention it calls Easy Mode, whereas Hard Mode requires the system to discover that answer independently before proof search begins. The program-verification work makes an analogous claim at the level of verification conditions: VC proving benefits from program comprehension because human proof engineers often discover and apply helper lemmas based on program semantics not directly reflected in the VCs produced by VC generators (Liu et al., 17 Apr 2026, Zhao et al., 23 Mar 2026).

A common misunderstanding is to treat DAP as a single benchmark protocol or a single algorithm. The current literature supports a broader reading: DAP names a general discover-then-prove paradigm, instantiated differently in Lean 4 Hard Mode ATP and in agentic deductive verification. This suggests that the unifying feature is not a particular logic or prover, but the explicit factoring of latent-solution search from downstream proof construction.

2. Formalization in Hard Mode automated theorem proving

The Hard Mode formulation is given explicitly in logical terms. In standard ATP benchmarks, an Easy Mode theorem can be written as

$\Phi \;\vdash\; Q(ans),$

where $\Phi$ are the premises and the critical solution $ans$ already appears syntactically in $Q$ . In Hard Mode, the same problem is encoded with two sorry-goals: $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ or, in Lean-idiomatic form,

$\texttt{theorem hard\_problem (… ) :}$

$\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$

$\texttt{ sorry -- discover ans}$

$\texttt{ sorry -- prove Q(ans)}$

(Liu et al., 17 Apr 2026)

The distinction is also expressed as a difference in search problems. Easy Mode provers solve

$\Phi \;\vdash\; Q(ans),$ 0

whereas Hard Mode solvers must solve

$\Phi \;\vdash\; Q(ans),$ 1

The paper emphasizes that nothing in $\Phi \;\vdash\; Q(ans),$ 2 constrains $\Phi \;\vdash\; Q(ans),$ 3, so a Hard Mode prover must explore $\Phi \;\vdash\; Q(ans),$ 4. It further states that, in complexity terms, if proof search on $\Phi \;\vdash\; Q(ans),$ 5 is already intractable, appending an unbounded answer-search layer cannot make the problem any easier, and in practice it dramatically expands the search space (Liu et al., 17 Apr 2026).

This formalization matters because it isolates a capability that Easy Mode can mask. Hard Mode benchmarks were introduced precisely to measure systems that must first recover the missing mathematical object and only then execute formal proof search.

3. DAP as a Lean 4 agentic framework

The Lean 4 DAP framework is an end-to-end agentic system with two principal modules: a Discovery Module and a Statement Rewriting stage, after which an existing proving system is invoked. The architecture is summarized as “Natural-language problem $\Phi \;\vdash\; Q(ans),$ 6 Discovery Module $\Phi \;\vdash\; Q(ans),$ 7 Rewriter $\Phi \;\vdash\; Q(ans),$ 8 Easy Mode Lean statement $\Phi \;\vdash\; Q(ans),$ 9 Proving Module $\Phi$ 0 formal proof” (Liu et al., 17 Apr 2026).

The Discovery Module uses an LLM—specifically GPT-OSS-120B—to generate a chain-of-thought solution and then perform explicit self-verification and self-correction before committing to an answer. Its pseudocode has the form DISCOVER(problem, max_iters), with three key stages: LLM.generate(prompt) to produce the initial chain of thought, LLM.self_verify(z) to return a structured error report identifying incorrect steps, and LLM.self_correct(z, report) to revise the solution. The report-driven loop continues for up to max_iters, and the paper states that allowing up to 10–30 verification iterations saturates accuracy on Hard Mode benchmarks (Liu et al., 17 Apr 2026).

Once $\Phi$ 1 has been discovered, DAP transforms the original Hard Mode Lean 4 statement $\Phi$ 2 into an Easy Mode statement $\Phi$ 3 by replacing the first sorry with $\Phi$ 4 and deleting that sorry-goal. Formally, if

$\Phi$ 5

$\Phi$ 6

$\Phi$ 7

then the rewriting operator $\Phi$ 8 is defined by

$\Phi$ 9

The rewriting is itself performed by the LLM via a prompt that injects $ans$ 0 into the Lean source and removes the now-trivial first sorry-goal. Because $ans$ 1 contains only one sorry-goal, it can be fed directly to an existing ATP system; the reported implementation uses Goedel-Prover-V2 (Liu et al., 17 Apr 2026).

The resulting architecture is modular in a precise sense stated by the paper: it cleanly separates answer search, which is informal and natural-language driven, from proof search, which is formal. A plausible implication is that improvements in either module can produce immediate gains without changing the other.

4. DAP in agentic program verification

In the program-verification setting, DAP is formulated as a joint synthesis problem over source code, specifications, helper lemmas, and machine-checked proofs. Let $ans$ 2 be annotated source code with specifications $ans$ 3; let $ans$ 4 be a trusted VC generator such as Frama-C/WP, producing proof-targeted verification conditions

$ans$ 5

let $ans$ 6 be a set of helper lemmas to be discovered; and let $ans$ 7 be a collection of machine-checked proofs in Coq/Rocq. DAP is then the problem of finding

$ans$ 8

Here $ans$ 9 indicates that Coq’s kernel accepts the proof in $Q$ 0 (Zhao et al., 23 Mar 2026).

The LemmaNet instantiation has five principal components. The VC Generator produces $Q$ 1. The Offline Lemma Synthesizer consists of a Program Semantic Analyzer (PSA), which consumes $Q$ 2 and emits a semantics-aware VC $Q$ 3 plus a skeletal proof, and an Obligation-Aligned Lemma Synthesizer, which takes $Q$ 4 together with each $Q$ 5 and outputs an initial set of offline helper lemmas $Q$ 6 and a proof plan. A Tactic-by-Tactic Proof Agent then drives Coq proofs using standard tactics augmented by retrieved lemmas. An Adaptive Lemma Maintainer (ALM) maintains a working library $Q$ 7. Finally, an Online Lemma Adapter monitors proof failures and invokes Feedback-Guided Lemma Adaptation to refine or generate lemmas on the fly, producing $Q$ 8 and updating $Q$ 9 (Zhao et al., 23 Mar 2026).

The offline stage is explicitly driven by program comprehension. The PSA uses an LLM prompt of the form: “Given the code of $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 0 and its ACSL annotations, produce a Coq lemma $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 1 in source terms and a proof.” From loops, pointer arithmetic, type predicates, and related patterns, it extracts high-level invariants such as “pointer increment by two preserves address order” and “if $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 2 then $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 3.” The paper also lists template families including monotonic pointer shift,

$\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 4

zero-shift identity,

$\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 5

type-predicate range extraction,

$\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 6

and arithmetic rewriting of $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 7 as $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 8 (Zhao et al., 23 Mar 2026).

The online stage is proof-state driven. At each step the prover maintains open goals $\Phi \;\vdash\; (\exists\, ans,\; \Psi(ans)) \quad \text{and} \quad \Psi(ans) \;\vdash\; Q(ans),$ 9, a context $\texttt{theorem hard\_problem (… ) :}$ 0, and the current lemma library $\texttt{theorem hard\_problem (… ) :}$ 1. Whenever standard tactics fail on some goal $\texttt{theorem hard\_problem (… ) :}$ 2, the agent invokes

$\texttt{theorem hard\_problem (… ) :}$ 3

updates $\texttt{theorem hard\_problem (… ) :}$ 4, and retries. The adaptation procedure first tries to Refine an existing lemma $\texttt{theorem hard\_problem (… ) :}$ 5, for example by fixing type or name mismatches, and then to Generate a new lemma by invoking the LLM on $\texttt{theorem hard\_problem (… ) :}$ 6 together with failure diagnostics. The underlying VCs are encoded over quantified or quantifier-free linear integer arithmetic, bitvector theory for overflow, an uninterpreted memory function $\texttt{theorem hard\_problem (… ) :}$ 7, and pointer arithmetic axioms (Zhao et al., 23 Mar 2026).

5. Deductive theory exploration as a precursor pattern

A plausible antecedent to DAP is the deductive theory-exploration framework developed for bottom-up lemma discovery. That work considers the input of algebraic inductive datatypes and a vocabulary $\texttt{theorem hard\_problem (… ) :}$ 8 of constructor and recursively defined symbols, and seeks an ideally complete set of new equational lemmas

$\texttt{theorem hard\_problem (… ) :}$ 9

over $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 0 that are not already provable from the input definitions (Singher et al., 2020).

Its core inference machinery already exhibits a discover-then-prove structure. It maintains a set $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 1 of known equations and uses: constructor introduction / structural induction to generate base and step obligations; congruence closure / rewrite to treat equations in $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 2 as bidirectional rewrite rules; symbolic observational equivalence (SOE) to conjecture $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 3 when symbolic examples up to fixed depth rewrite into the same congruence-closure class; and speculative generalization, which replaces repeated placeholders with fresh ones and attempts to prove the more general formula first (Singher et al., 2020).

The overall pipeline proceeds by iterative deepening on term depth. A growing e-graph $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 4 contains all $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 5-terms of depth $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 6, modulo congruence closure under the current $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 7. At each depth, the procedure enumerates terms, inserts them into $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 8, infers conjectures by SOE, screens out conjectures already equivalent in $\texttt{ (ans : Type) → … → ans = ? → … → Q(ans) := by}$ 9, and then attempts proofs by one round of structural induction on the first inductive-type placeholder. If all cases close using only congruence closure plus $\texttt{ sorry -- discover ans}$ 0, the equation is added to $\texttt{ sorry -- discover ans}$ 1, the e-classes are merged, and the process returns to conjecture inference so that newly discovered lemmas can seed further discoveries (Singher et al., 2020).

The running list example makes the loop concrete. Starting from the definitions of list concatenation and filter, depth $\texttt{ sorry -- discover ans}$ 2 yields terms

$\texttt{ sorry -- discover ans}$ 3

SOE over symbolic examples $\texttt{ sorry -- discover ans}$ 4 conjectures associativity, which is then proved by one-step structural induction on $\texttt{ sorry -- discover ans}$ 5. Once associativity enters $\texttt{ sorry -- discover ans}$ 6, the system can derive filter fusion: $\texttt{ sorry -- discover ans}$ 7 The paper contrasts this purely deductive method with testing-based explorers such as IsaCoSy and QuickSpec/Hipster, arguing that symbolic examples, congruence closure, and shallow rewriting avoid random-testing filters and repeated SMT queries, yield more nonredundant lemmas, produce fewer spurious conjectures, and offer comparable or better runtime on standard benchmarks (Singher et al., 2020).

6. Empirical findings, significance, and open issues

The Hard Mode Lean 4 work evaluates DAP on four benchmarks: MiniF2F-Hard with 244 total problems and 197 “solution-style” problems, FIMO-Hard with 149 total and 70 solution-style, a Hard Mode CombiBench variant with 100 total and 45 solution-style combinatorics problems, and PutnamBench Hard Mode with 660 total and 340 solution-style problems. After rewriting, the system invokes Goedel-Prover-V2 at Pass@32. The reported table gives DAP (w/o Agent) as $\texttt{ sorry -- discover ans}$ 8 on Putnam, $\texttt{ sorry -- discover ans}$ 9 on Combi, $\texttt{ sorry -- prove Q(ans)}$ 0 on miniF2F, and $\texttt{ sorry -- prove Q(ans)}$ 1 on FIMO; DAP (w/ Agent) as $\texttt{ sorry -- prove Q(ans)}$ 2, $\texttt{ sorry -- prove Q(ans)}$ 3, $\texttt{ sorry -- prove Q(ans)}$ 4, and $\texttt{ sorry -- prove Q(ans)}$ 5, respectively. The same work states that on PutnamBench it is the first public system to solve any Hard Mode problems, with 36 total theorems, and that on CombiBench Hard Mode it improves from the prior 8 to 10 theorems solved; the abstract also reports an increase from 7 (previous SOTA, Pass@16) to 10 (Liu et al., 17 Apr 2026).

The discovery-only numbers are substantially higher than the formal-proof numbers. Without self-reflection, the Discovery Module achieves approximately $\texttt{ sorry -- prove Q(ans)}$ 6 on PutnamBench and $\texttt{ sorry -- prove Q(ans)}$ 7 on MiniF2F-Hard; with self-reflection these rise to $\texttt{ sorry -- prove Q(ans)}$ 8 and remain at $\texttt{ sorry -- prove Q(ans)}$ 9. By contrast, the same source states that the formal prover succeeds on under $\Phi \;\vdash\; Q(ans),$ 00 of Putnam Bench problems in pure Easy Mode, and summarizes the resulting gap as over $\Phi \;\vdash\; Q(ans),$ 01 answer accuracy versus under $\Phi \;\vdash\; Q(ans),$ 02 formal-proof success on the same Hard Mode problems (Liu et al., 17 Apr 2026).

The LemmaNet instantiation reports results on 941 VCs from two suites, SV-COMP and NTP4VC, including Linux-kernel modules, Contiki OS, the standard C++ library, and an X.509 parser, with a 10 min timeout. LemmaNet proves $\Phi \;\vdash\; Q(ans),$ 03 on SV-COMP and $\Phi \;\vdash\; Q(ans),$ 04 on NTP4VC, for a total of 364, compared with AutoRocq at 287 total, Copra at 240, and CoqHammer at 123. The paper states a $\Phi \;\vdash\; Q(ans),$ 05 total improvement over AutoRocq and $\Phi \;\vdash\; Q(ans),$ 06 over CoqHammer, with median time per VC of approximately $\Phi \;\vdash\; Q(ans),$ 07 on SV-COMP and $\Phi \;\vdash\; Q(ans),$ 08 on NTP4VC, comparable to baselines (Zhao et al., 23 Mar 2026).

Two substantive conclusions recur across these works. First, DAP exposes a structural bottleneck: discovering the right mathematical object or lemma can be much easier than converting that discovery into a fully formal proof. Second, the discovery stage is highly domain-sensitive. In Hard Mode ATP, explicit self-reflection improves answer discovery; in program verification, program comprehension bridges the gap between source-level semantics and mechanically encoded VCs. The main open challenges are also explicit in the literature: offline synthesis may miss deep domain-specific invariants, proof-state-driven generation can introduce redundancy, lemma-selection policies remain to be optimized, and extending to richer theories such as floating-point and concurrency requires new templates and adaptation strategies (Zhao et al., 23 Mar 2026).

Taken together, these results place DAP at the intersection of theorem proving, deductive verification, and theory exploration. The common pattern is stable across domains: discover the missing object, rewrite or recontextualize the task around that object, and then prove within an existing formal kernel or ATP stack.