AutoRocq: Autonomous Proof Automation
- AutoRocq is an LLM-based system that automates formal program verification in Rocq using iterative, feedback-driven tactics.
- It dynamically manages proof trees by continuously refining tactics based on real-time responses from the trusted Rocq kernel.
- Empirical results show AutoRocq outperforms traditional methods on benchmarks like CoqGym and SV-COMP with faster proof times and higher success rates.
AutoRocq is an LLM-based agentic proof automation system for formal program verification in Rocq (formerly Coq), architected for autonomous, iterative, and context-aware theorem proving. The system departs from both batch proof synthesis and static retrieval approaches by embedding a feedback-driven agent in a closed loop with the interactive theorem prover. AutoRocq demonstrates state-of-the-art program verification performance on several challenging benchmarks by dynamically retrieving context, maintaining a proof tree, and refining tactic choices based on real-time feedback—all underpinned by the trusted Rocq kernel for formal correctness (Tu et al., 21 Nov 2025, Kozyrev et al., 5 Feb 2026). Furthermore, AutoRocq as a class refers to fully automatic mechanisms for optimizing proof-generation agents in Rocq, exploring methods for prompt bootstrapping, memory retrieval, and agentic control structure optimization.
1. Motivation and Context
Program verification has become increasingly critical as AI-generated code proliferates, bringing new risks of subtle semantic errors and security vulnerabilities. Traditional formal methods leveraging Coq/Rocq can provide machine-checked correctness, but their manual use is labor-intensive and generally impractical at scale. Prior automated theorem proving systems either produce entire scripts in a single pass (e.g., PALM), rely on retrieval based on static similarity (Rango), or treat proof search as reinforcement learning over tactic sequences (QEDCartographer), but lack agentic, context-adaptive reasoning.
AutoRocq targets this gap by constructing proofs through iterative collaboration between an LLM-based agent and the Rocq theorem prover. The agent interprets evolving proof states, issues context-specific queries, and updates its approach in response to Rocq’s feedback, autonomously managing the architecture of the proof tree. This agentic architecture is essential for scaling verification to large or machine-generated codebases while upholding formal guarantees (Tu et al., 21 Nov 2025).
2. System Architecture and Algorithmic Workflow
AutoRocq’s pipeline consists of an agent loop tightly coupled with Rocq through the CoqPyt interface. The core workflow proceeds as follows:
- Input: The agent receives a lemma statement and context (comprising global environment and local hypotheses).
- Decision Loop: The LLM-based agent, operating deterministically (temperature ), inspects the partial proof tree and history, selecting between producing a Coq tactic or issuing a Rocq query (e.g., “Search”).
- Prover Interaction: The Rocq backend processes or , returning updated subgoals, contextual information, or error messages.
- Proof Tree Management: The agent dynamically expands the proof tree , maintaining tactic histories and subgoal structures.
- Refinement Mechanism: Upon failure, the agent uses Rocq’s feedback to revise tactics, retrieve additional context, or adjust the proof path; after unsuccessful attempts (empirically optimal at ), a context query is triggered.
- Termination and Certification: Once completed, the resulting tactic sequence is validated by Rocq’s trusted kernel, ensuring formal soundness.
The iterative refinement algorithm can be formalized as: 0
3. Automatic Optimization Strategies in AutoRocq
Comprehensive studies have been conducted on methods for optimizing Rocq agents through fully automatic feedback loops (Kozyrev et al., 5 Feb 2026). Key approaches include:
- BootstrapFewShot: In-context learning using 1 successful proof traces as exemplars, with 2 selected to maximize success on the training set. This “few-shot” approach provides concrete recipes, boosting proof rates significantly at low engineering cost.
- MIPROv2: Bayesian optimization over discrete choices of instruction templates and in-context demonstrations, maximizing empirical proof success.
- SIMBA: Random prompt search via synonym swaps and example reordering, iteratively accepting improvements.
- GEPA: Evolutionary prompt adaptation with LLM-guided reflection and example crossover, although not consistently yielding gains.
- ACE (Agentic Context Engineering): Curated and iteratively refined contextual memory, injected into inference-time prompts.
- ReasoningBank: Similarity-based retrieval of past (state, reasoning) pairs by semantic embedding for injection into the agent prompt.
- ADAS: LLM-driven rewrite of the agent’s control logic (the “decide(…)” function), optimizing agentic behavior.
Among these, prompt-based few-shot bootstrapping is empirically the most robust, particularly when paired with the ReAct agent structure. Bayesian or random search over prompt variations (MIPROv2, SIMBA) can match or slightly exceed few-shot performance, but gains saturate above 3 due to context window limitations.
4. Empirical Performance and Comparative Assessment
AutoRocq was evaluated on structured benchmarks: SV-COMP (C program verification), CoqGym (mathematical and data-structure theorems), and Linux kernel module correctness (Tu et al., 21 Nov 2025, Kozyrev et al., 5 Feb 2026). Notable performance findings include:
| Tool | CoqGym (%) | SV-COMP (%) | Avg. Proof Time (s) |
|---|---|---|---|
| AutoRocq | 51.1 | 30.9 | 21.3 |
| Rango | 42.3 | 21.7 | 105.5 |
| PALM | 11.5 | 10.1 | 45.4 |
| QEDC | 17.6 | 20.4 | 5.1 |
| P9001 | 12.6 | 17.6 | 22.6 |
On Linux kernel case studies, AutoRocq outperformed all baselines, proving 20% of selected lemmas versus 17% (PALM) and 5% (Rango, QEDC). AutoRocq’s agentic retrieval and iterative tactic revision provide strong gains in uniformity across lemma complexity and proof patterns, and its average proof time is competitive, being 5x faster than Rango on SV-COMP.
Ablation studies confirm that agentic context retrieval is critical: disabling context search (“NoContext”) reduces proof success by 14 percentage points. Conversely, proof-tree awareness and feedback contribute measurable but not dominant gains.
5. Key Insights, Failure Modes, and Best Practices
AutoRocq’s effectiveness is primarily attributed to tightly coupled prompt engineering and context retrieval mechanisms. Empirical evidence demonstrates that, for agentic Rocq pipelines:
- Few-shot exemplars should be selected for maximal coverage of prototypical proof patterns.
- Long proof traces saturate context window budgets (4 yields diminishing returns).
- Semantically meaningful query embeddings are essential for retrieval-based memory injection.
- Monolithic control rewrites (ADAS) risk overfitting; modular planning + execution + reflection remains optimal.
Failure cases predominantly occur for extremely complex goals (where neither the LLM nor heuristic queries suffice) or when critical auxiliary lemmas are absent from the context. The agent is robust to hallucinated tactics, as Rocq’s kernel rejects any invalid steps, preserving logical soundness.
6. Theoretical Guarantees and Limitations
The soundness of AutoRocq is inherited from Rocq’s trusted kernel. For any proof script 5 generated under context 6: 7 Dependence on closed-source LLMs introduces cost implications and rare hallucination risks, though rejected by the kernel. The current system’s context retrieval is syntax-driven (“Search” commands) and may benefit from more sophisticated semantic search or proof-pattern mining. Multi-threading and heap-manipulating software remain outside current scope, and loop invariant generation leverages lightweight property testing rather than full formal inference.
7. Prospective Directions and Hybrid Approaches
Future research avenues and best practices for AutoRocq systems include:
- Joint prompt + retrieval architectures: keep 8 few-shot traces, retrieve top-3 to 5 relevant past proof states dynamically.
- Lightweight Bayesian tuning of hyperparameters (e.g., generation temperature, prompt verbosity).
- Minimal “reflection” phases interleaved every fixed number of tactics to check for induction or other standard strategies.
- Integration with automated theorem provers (ATPs), such as CoqHammer, for subgoal discharge.
- Closing the generate-and-validate loop by coupling with AI coding agents for end-to-end trusted automatic programming.
- Reinforcement learning for adaptive retrieval/generation on agentic success metrics.
This suggests that next-generation AutoRocq agents will benefit from structured hybrid optimization—balancing prompt engineering, semantic retrieval, and modular reasoning control for scalable, robust formal verification (Tu et al., 21 Nov 2025, Kozyrev et al., 5 Feb 2026).