Papers
Topics
Authors
Recent
Search
2000 character limit reached

SolEval+: Benchmark for Solidity LLMs

Updated 3 February 2026
  • The SolEval+ Benchmark is a comprehensive suite that evaluates contract-level Solidity code generation by leveraging real on-chain contracts and rigorous dynamic testing.
  • It employs a dynamic, transaction-driven framework that replays 1,000 transactions per contract to ensure byte-for-byte equivalence in execution and state.
  • The benchmark integrates meticulous context extraction and strict validation protocols to address syntax variations, dependency hierarchies, and inter-function consistency.

The SolEval+ Benchmark is a comprehensive suite for evaluating the contract-level code generation capabilities of LLMs in Solidity, focusing on realistic smart contract development with explicit functional, contextual, and security correctness criteria. It represents a major advancement beyond prior benchmarks, which predominantly address only function-level code generation or synthetic tasks, and is constructed on a foundation of real-world, on-chain contracts and rigorous dynamic validation. SolEval+ is referenced interchangeably with SolContractEval in the literature and is now the de facto standard for contract-level Solidity code generation benchmarking (Ye et al., 28 Sep 2025, Chen et al., 30 Jan 2026).

1. Motivation and Benchmark Evolution

Historically, evaluations of LLMs for Solidity code generation were limited to function-level or repository-level tasks, as exemplified by SolBench (Chen et al., 3 Mar 2025) and SolEval (Peng et al., 26 Feb 2025). These earlier benchmarks neglected key phenomena in smart contract engineering, such as inter-function consistency, inheritance hierarchies, version-dependent syntax, and behaviors emergent only under realistic transaction sequences. Such gaps motivated the creation of SolEval+, the first contract-level benchmark. SolEval+ evaluates the capability to generate entire contracts reflective of real on-chain artifacts, integrates a realistic dependency context, and employs dynamic, transaction-driven evaluation—addressing the limitations of static or isolated function-level tests (Ye et al., 28 Sep 2025).

2. Benchmark Construction and Task Design

SolEval+ tasks are derived from verified mainnet contracts mined from Ajibola et al.’s Etherscan dataset. Contracts are filtered for ≥1,000 real on-chain transactions, deduplicated via SimHash, balanced for broad compiler version coverage (0.4.x–0.8.x), and restricted to contracts with substantive implementation (excluding interfaces and trivial constructors). This rigorous curation yields 124 contract-level tasks spanning nine critical application domains: DeFi, GameFi, ERC20/721 token standards, DAOs, proxy patterns, oracles, marketplaces, and "other" categories (Ye et al., 28 Sep 2025).

Each task is structured for LLM input as follows:

  • Dependency context: A minimal but complete set of external definitions—imports, libraries, inherited contracts—extracted via a static call-graph analysis. Algorithm 1 formalizes this process, using tools such as Slither to enumerate and resolve all call dependencies.
  • Structured contract framework: Includes license specifiers, pragma directives, original imports, state variable declarations, constructors, and full function signatures (retaining NatSpec documentation). Only function bodies are omitted for generation.
  • Natural-language prompt: A concise, NatSpec-format textual description of each function's intended behavior, coupled with the required Solidity compiler version. All NatSpec annotations are independently authored and cross-validated by experienced developers to prevent leakage and ensure semantic accuracy.

3. Dynamic Multidimensional Evaluation Framework

SolEval+ employs dynamic, on-chain–inspired functional correctness evaluation. For each target contract, the first 1,000 transactions (with opcode traces, storage slot changes, event logs, and full call context) are replayed deterministically on a local Hardhat network using the same compiler version as the original deployment. Success criteria require that generated contracts:

  • Produce identical execution status and return values for every transaction (byte-for-byte equivalence).
  • Emit matching events in both signature and payload, including indexed and unindexed event fields.
  • Maintain storage slot state evolution sequences that are SHA3-256 hash-identical to those of the original on-chain contract.

Key metrics are formalized as follows:

  • Pass@k: The probability that at least one of kk generations passes both compilation and dynamic simulation. The unbiased estimator is Pass@k=Ei[1(ncik)(nk)]\operatorname{Pass@}k = \mathbb{E}_i\left[1 - \frac{\binom{n-c_i}{k}}{\binom{n}{k}}\right], with nn samples and cic_i successful generations per task.
  • Compile@k: As above, but counting compilable outputs only.
  • Function Pass Rate (FPR): The fraction of isolated functions passing their dynamic tests.
  • Contract Full Pass Rate (CFPR): The proportion of contracts for which all functions pass dynamic evaluation.

4. Experimental Protocols and Model Baselines

Six mainstream LLMs from Anthropic, OpenAI, Google DeepMind, DeepSeek-AI, and Alibaba are evaluated using a uniform pipeline:

  • Generation: For each task, n=5n=5 candidates are produced either by greedy decoding (k=1k=1) or nucleus sampling (k>1k>1, temperature=0.2, top_p=0.95).
  • Prompt composition: Prompt includes full extracted context, contract scaffold, and cross-validated NatSpec descriptions.
  • Test environment: All dynamic evaluations are performed on Ubuntu 20.04.5 with 128-core Xeon Platinum CPUs, 8×NVIDIA A800 GPUs, and a Hardhat network.
  • Compiler fidelity: The Solidity compiler version matches the contract’s original pragma (Ye et al., 28 Sep 2025).

A selection of recent studies also evaluate agent-based frameworks, such as SolAgent, on a reduced SolEval+ benchmark (81 files, 1,125 functions, 1,188 tests). SolAgent utilizes a dual-loop refinement protocol, alternating between Forge-based compilation/testing and Slither-driven static security analysis. Comparative baselines include plain LLMs, AI IDE agents (Copilot, MetaGPT, DeepCode), and ablations of tool feedback (Chen et al., 30 Jan 2026).

5. Results, Analysis, and Challenges

SolEval+ reveals a marked performance gap between contract-level Solidity generation and general-purpose code generation:

  • LLM performance: The best LLM (Claude-3.7-Sonnet) achieves Pass@1 of 40.7% and Compile@1 of 85.5% across 124 tasks. Even high-capacity models underperform compared to their class-level results on generic programming tasks (e.g., GPT-3.5: ClassEval Pass@1 ≈29.6% vs. SolEval+ Pass@1 ≈12.1%) (Ye et al., 28 Sep 2025).
  • Domain variance: Models excel on standardized ERC20 contracts (reflecting pattern familiarity) but suffer on complex ERC721, DeFi, and contracts with intricate ownership logic or multi-level dependencies.
  • Primary error categories: A significant proportion of failures arise from version-specific syntax (25.6% ParserErrors), the storage/memory distinction (12.4% TypeErrors), declaration resolution across inheritance hierarchies (21.7% DeclarationErrors), and inter-function consistency lapses. Notably, function-level FPRs (~95%) far exceed contract-level Pass@1 (~41%), highlighting the challenge of integrated contract behaviors under realistic state evolution.
  • Agent-based advances: SolAgent (with Claude-Sonnet-4.5), using refinement guided by full Forge unit testing and Slither feedback, achieves a Pass@1 of 64.39% on the 81-file subset, with up to a 39.8% reduction in detected vulnerabilities versus the human-written baseline (Chen et al., 30 Jan 2026).
Model/Framework Pass@1 (%) Compile Rate Vulnerability Reduction
Claude-3.7-Sonnet 40.7 85.5
SolAgent + Claude 64.4 95.1 −15.7%
Copilot (Claude) 10.0 32.1 N/A
Human-written baseline 100.0 100.0

6. Limitations, Extensions, and Future Directions

SolEval+ has established multidimensional contract-level benchmarks with dynamic replay as a rigorous correctness criterion and highlighted persistent challenges for both model and evaluation methodology:

  • Version and context sensitivity: Models struggle with the evolution of Solidity syntax and require precise context extraction to minimize undeclared identifier and type errors.
  • Inter-contract and cross-function dependencies: The leap from function- to contract-level tasks introduces exponential complexity via inheritance, external contracts, and storage side effects.
  • Prospective enhancements (Peng et al., 26 Feb 2025, Ye et al., 28 Sep 2025):
    • Include formal verification (SMT-based checks with tools like VerX, Certora Prover), runtime execution cost (dynamic gas and latency profiling), and more granular code complexity measurements.
    • Broaden linguistic and domain coverage (Vyper code, cross-chain patterns, multi-ecosystem tasks).
    • Integrate security adversary tests and automated continuous integration pipelines for evolving benchmarks.

This suggests that LLM benchmarking for smart contract generation must evolve continuously to capture the complexity and security criticality of modern decentralized application development.

7. Reproducibility and Community Contributions

SolEval+ supports detailed reproduction steps for the research community: exact source selection and annotation rules, explicit context extraction algorithms, hardware setup, prompt templates, and CI-ready test harnesses are provided. Extensions and new task contributions follow strictly defined filtering and validation protocols. Full benchmark code and agent frameworks such as SolAgent are released for end-to-end automated evaluation and model distillation, fostering open research in secure, correct Solidity code generation (Peng et al., 26 Feb 2025, Ye et al., 28 Sep 2025, Chen et al., 30 Jan 2026).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SolEval+ Benchmark.