Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

Published 7 Sep 2025 in cs.AI | (2509.06239v1)

Abstract: LLMs have demonstrated impressive capabilities in automated code generation but frequently produce code that fails formal verification, an essential requirement for hardware and safety-critical domains. To overcome this fundamental limitation, we previously proposed PREFACE, a model-agnostic framework based on reinforcement learning (RL) that iteratively repairs the prompts provided to frozen LLMs, systematically steering them toward generating formally verifiable Dafny code without costly fine-tuning. This work presents Proof2Silicon, a novel end-to-end synthesis framework that embeds the previously proposed PREFACE flow to enable the generation of correctness-by-construction hardware directly from natural language specifications. Proof2Silicon operates by: (1) leveraging PREFACE's verifier-driven RL agent to optimize prompt generation iteratively, ensuring Dafny code correctness; (2) automatically translating verified Dafny programs into synthesizable high-level C using Dafny's Python backend and PyLog; and (3) employing Vivado HLS to produce RTL implementations. Evaluated rigorously on a challenging 100-task benchmark, PREFACE's RL-guided prompt optimization consistently improved Dafny verification success rates across diverse LLMs by up to 21%. Crucially, Proof2Silicon achieved an end-to-end hardware synthesis success rate of up to 72%, generating RTL designs through Vivado HLS synthesis flows. These results demonstrate a robust, scalable, and automated pipeline for LLM-driven, formally verified hardware synthesis, bridging natural-language specification and silicon realization.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a reinforcement learning-based framework (PREFACE) that iteratively refines prompts to generate formally verified Dafny code.
It details an end-to-end synthesis pipeline that converts verified Dafny code to Python and then to HLS C code optimized for FPGA synthesis.
Evaluation results show enhanced verification and synthesis metrics, paving the way for AI-driven, trustworthy hardware generation.

Proof2Silicon: Prompt Repair for Verified Code and Hardware Generation via Reinforcement Learning

Introduction

Recent advancements in LLMs have enhanced automated code generation, yet formal verification remains problematic, especially critical in domains like aerospace, cryptography, and hardware design. Proof2Silicon presents a sophisticated framework integrating reinforcement learning (RL) to optimize prompting processes, enabling seamless conversion from natural language to verified code and hardware synthesis.

Methodology

PREFACE Framework

At the heart of Proof2Silicon lies PREFACE, an RL-driven framework designed to refine prompts for LLMs to generate verifiable Dafny code without fine-tuning. PREFACE leverages a Small LLM (SLM) trained via Proximal Policy Optimization (PPO) to iteratively adapt prompts based on formal verification feedback.

Figure 1: Overview of the PREFACE framework forming the prompt-optimization core embedded within the Proof2Silicon pipeline.

Proof2Silicon Workflow

Proof2Silicon extends PREFACE into an end-to-end synthesis pipeline:

Dafny Code Verification: Initiated with natural language input, PREFACE refines prompts until successfully generating verified Dafny code.
Dafny to Python Transformation: Verified Dafny code is translated to Python, sanitized to remove dependency on Dafny-specific libraries.
PyLog Integration: The Python code is further refined using PyLog decorators, guiding conversions to HLS-compatible C code through loop and memory optimizations.
FPGA Synthesis with Vivado HLS: Finally, the optimized HLS C code undergoes synthesis into RTL on FPGA platforms using Vivado HLS.
Figure 2: Overview of the proposed Proof2Silicon framework.

Implementation Results

Verification Success

Proof2Silicon demonstrates significant improvements in verification success rates through embedded prompt optimization compared to traditional methods.

(Table 1)

Table 1: Verification success rates across LLMs using various prompt strategies, highlighting improvements with trained SLM.

Hardware Synthesis

Evaluation revealed robust synthesis readiness, achieving high FPGA synthesis rates for verified programs. Notable synthesis metrics include:

HLS synthesis success rates of up to 72.4% for Gemini-2-Flash
Predictable synthesis latencies and memory profiles

(Table 2)

Table 2: Proof2Silicon hardware synthesis results, with metrics for synthesis readiness and performance.

Challenges and Future Work

Proof2Silicon, while promising, faces challenges such as unsupported constructs during synthesis. Future work involves refining translation flows to reduce synthesis failures and integrate performance optimization directly into prompting processes. Such improvements aim to bridge gaps between formal verification and practical hardware realization.

Conclusion

Proof2Silicon offers a cohesive pipeline, extending prompt optimization to verified hardware synthesis seamlessly. It capitalizes on RL-driven methods to enhance formal verification success and hardware synthesis robustness, paving a promising path for AI-driven design automation. This establishes Proof2Silicon as a scalable tool for formally verified, trustworthy hardware generation from natural language specifications.

Markdown Report Issue