Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 81 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 32 tok/s Pro

GPT-4o 99 tok/s Pro

Kimi K2 195 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

From Evaluation to Enhancement: Large Language Models for Zero-Knowledge Proof Code Generation (2509.11708v1)

Published 15 Sep 2025 in cs.SE

Abstract: Zero-knowledge proofs (ZKPs) are increasingly deployed in domains such as privacy-preserving authentication, blockchain scalability, and secure finance. However, authoring ZK programs remains challenging: unlike mainstream programming, ZK development requires reasoning about finite field arithmetic, constraint systems, and gadgets, making it knowledge-intensive and error-prone. While LLMs have demonstrated strong code generation capabilities in general-purpose languages, their effectiveness for ZK programming, where correctness hinges on both language mastery and gadget-level reasoning, remains unexplored. To address this gap, we propose \textsc{ZK-Eval}, a domain-specific evaluation pipeline that probes LLM capabilities at three levels: language knowledge, gadget competence, and end-to-end program generation. Our evaluation of four state-of-the-art LLMs reveals that models excel at surface-level syntax but struggle with gadget usage and semantic correctness, often yielding incorrect programs. Based on these insights, we introduce \textsc{ZK-Coder}, an agentic framework that augments LLMs with constraint sketching, guided retrieval, and interactive repair. Experiments on Circom and Noir show substantial gains, with success rates improving from 17.35\% to 83.38\% and from 32.21\% to 90.05\%, respectively. With \textsc{ZK-Eval} and \textsc{ZK-Coder}, we establish a foundation for systematically measuring and augmenting LLMs in ZK code generation to lower barriers for practitioners and advance trustworthy computation.

Summary

The paper showcases a novel framework (ZK-Coder) that significantly improves LLM accuracy for zero-knowledge proof code synthesis.
It introduces ZK-Eval, a benchmark evaluating LLM performance on syntax, gadget implementation, and end-to-end ZKP generation.
Empirical results show up to 94.12% accuracy on Noir and 89.23% on Circom, underscoring the benefits of agentic augmentation.

LLMs for Zero-Knowledge Proof Code Generation: Evaluation and Enhancement

Introduction

This paper addresses the intersection of LLMs and zero-knowledge proof (ZKP) program synthesis, focusing on the unique challenges posed by ZK programming compared to mainstream software development. ZKPs require developers to encode mathematical constraints over finite fields, leveraging domain-specific languages (DSLs) such as Circom and Noir, and to compose reusable gadgets for constraint systems. The authors introduce ZK-Eval, a comprehensive benchmark for evaluating LLM capabilities in ZK code generation, and ZK-Coder, an agentic framework that augments LLMs with constraint sketching, retrieval-augmented generation (RAG), and interactive repair. The empirical results demonstrate substantial improvements in end-to-end ZK program synthesis, with ZK-Coder achieving up to 94.12% accuracy on Noir and 89.23% on Circom, compared to baseline rates below 33%.

Figure 1: Comparison between mainstream and ZK programming workflows, highlighting the constraint-oriented nature of ZK development.

ZK Programming: Challenges and Requirements

ZK programming diverges fundamentally from imperative paradigms. Instead of specifying computation, developers encode relations $R(x, w) = 1$ over public inputs $x$ and private witnesses $w$ , which are compiled into arithmetic circuits. DSLs such as Circom and Noir expose different abstraction levels: Circom requires explicit circuit wiring and signal management, while Noir offers Rust-like syntax and type safety. The complexity of ZK development is compounded by the need to compose and implement gadgets—modular building blocks for constraints—where correctness is not always enforced by compilers.

Figure 2: Overview of the ZK-Eval benchmark and ZK-Coder agentic framework.

ZK-Eval: Benchmarking LLMs for ZK Code Generation

ZK-Eval is designed to probe LLM capabilities at three granular levels:

Language and Toolchain Knowledge: Assessed via multiple-choice questions (MCQs) covering syntax, advanced features, API references, and compiler principles for Circom and Noir.
Gadget-Level Competence: Evaluated by tasks requiring the use or implementation of 35 representative gadgets, spanning logical, arithmetic, and composite operators.
End-to-End Program Generation: Adapted from HumanEval, reformulated as verification problems suitable for ZK DSLs, with dual test suites for soundness and completeness.
Figure 3: Design of the MCQ benchmark for evaluating LLM knowledge of ZK languages.

Figure 4: Example MCQ on Circom syntax and summary of knowledge categories.

Figure 5: Gadget benchmark design for assessing LLMs' ability to encode constraints using gadgets.

Figure 6: End-to-end ZK program generation benchmark adapted from HumanEval.

Empirical Evaluation: LLM Performance on ZK-Eval

Four LLMs were evaluated: GPT-o4-mini, GPT-o3, DeepSeek-V3, and Qwen3. Key findings include:

Language Knowledge: Reasoning models (GPT-o4-mini, GPT-o3) achieve near-human expert accuracy (88.1% and 87.2%), outperforming open-source models (DeepSeek-V3, Qwen3 at ~79%).
Gadget Competence: All models struggle, with logical gadgets reaching only ~52% accuracy and arithmetic/composite gadgets dropping below 20%. Typing errors and signal mismanagement are prevalent.
End-to-End Generation: Baseline pass rates are low (Circom: 17.35–20.29%, Noir: 27.94–32.21%), with semantic correctness lagging behind syntactic validity.

Figure 7: Accuracy of LLMs and human experts on the MCQ benchmark for ZK language knowledge.

Figure 8: Error distribution in gadget implementations across languages, types, and causes.

ZK-Coder: Agentic Enhancement of LLMs

ZK-Coder is introduced to bridge the gap between surface-level language knowledge and reliable ZK program synthesis. Its pipeline consists of:

Constraint Sketching: Translates natural language specifications into ZKSL, an intermediate sketch language abstracting constraints.
Constraint-Guided Retrieval: Analyzes sketches to extract required gadgets and retrieves implementation hints from a curated knowledge base, enforcing operator/type/arity matching.
Interactive Generation and Repair: Iteratively generates code, compiles, and tests against semantic oracles, repairing based on diagnostics and counterexamples until correctness is achieved.
Figure 9: Overview of ZK-Coder's design, illustrating the agentic workflow from sketching to repair.

Experimental Results: ZK-Coder vs. Baseline

ZK-Coder demonstrates substantial improvements over direct LLM prompting:

Circom: GPT-o4-mini achieves 83.38% overall accuracy, GPT-o3 89.23%.
Noir: GPT-o4-mini reaches 90.05%, GPT-o3 94.12%.
Token Cost: The agentic pipeline incurs higher token usage but remains cost-effective (<0.1 USD per task).

Ablation studies confirm the necessity of each component: removing sketching, RAG, or repair loops results in significant accuracy drops (e.g., disabling syntax repair reduces accuracy to 35% on Circom).

Figure 10: Accuracy comparison of ZK-Coder ablation variants, highlighting the impact of each design component.

Robust Generalization and Failure Analysis

On the contamination-free LiveCodeBench benchmark, ZK-Coder maintains strong performance (Circom: 72.25%, Noir: 82.32%), far exceeding baseline rates. Failure analysis reveals that most errors stem from exceeding the repair budget or incorrect sketch generation, underscoring the need for improved ZK-specific training data and more robust sketch grounding.

Illustrative Example: Sudoku Verification

The paper provides a detailed workflow example where ZK-Coder generates a ZKP program for verifying Sudoku correctness, demonstrating the translation from natural language to ZKSL, constraint-guided retrieval, and iterative repair.

Figure 11: ZK-Coder's workflow for proving Sudoku correctness.

Figure 12: Example Circom program generated by ZK-Coder for Sudoku verification.

Figure 13: Example Noir program generated by ZK-Coder for Sudoku verification.

Implications and Future Directions

The results establish that while LLMs possess strong surface-level knowledge of ZK DSLs, reliable gadget construction and end-to-end synthesis require agentic augmentation. ZK-Coder's pipeline—combining sketching, RAG, and interactive repair—substantially lowers the barrier for ZK program development and advances trustworthy computation. The findings suggest several avenues for future research:

Circuit Optimization: Guiding LLMs to produce efficient circuits, not just correct ones, by incorporating ZK-friendly primitives and optimization strategies.
Training Data Expansion: Systematic generation of ZK DSL examples to support targeted LLM fine-tuning and domain adaptation.
Benchmark Extension: Incremental addition of advanced and cryptographic gadgets to the evaluation suite.

Conclusion

This work presents ZK-Eval, the first systematic benchmark for LLM-based ZK program synthesis, and ZK-Coder, an agentic framework that significantly enhances LLM performance in this domain. The empirical evidence demonstrates that agentic augmentation is essential for bridging the gap between language knowledge and reliable ZK code generation, with practical implications for privacy-preserving computation and secure software engineering.