Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols

Published 29 Apr 2026 in cs.CR | (2604.26495v2)

Abstract: Code-driven auditing fails when correctness depends on what the specification requires rather than how the code is written. Production blockchain networks expose this directly: byzantine consensus runs many independent clients of a shared specification, so a specification-divergence defect in one client can fork the network or halt finality. Existing tools reason one repository at a time, with no shared baseline held constant across implementations. We present SPECA, an LLM-driven audit framework that derives explicit, categorized security properties (invariants, pre/postconditions, trust assumptions) from natural-language specifications and reuses them across implementations. SPECA enables controlled cross-implementation comparison, detections grounded in specification invariants no code pattern encodes, and false positives traceable to a specific pipeline phase rather than opaque model errors. On the Sherlock Ethereum Fusaka Audit Contest (10 targets, 366 submissions), SPECA recovers all 15 in-scope H/M/L vulnerabilities expert-augmented (8/15 automated-only) and surfaces 4 fix-confirmed bugs, including a cryptographic-invariant violation missed by every adjudicated finding. On the RepoAudit C/C++ benchmark, SPECA reaches 88.9% precision at 100% recall (F1=0.94) and surfaces 12 author-validated bugs beyond ground truth, two externally validated. SPECA also flags 5 of RepoAudit's 40 published bugs as defensive-coding fixes with no reachable exploit path. False positives trace to three pipeline-pinned root causes; a multi-model study identifies property-generation quality as the binding constraint. End-to-end cost is ~$30 per H/M/L bug (~42 min wall-clock under parallel execution).

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces SPECA, a framework that systematically derives and verifies specification-based security properties for distributed protocols.
It employs a six-phase pipeline combining program graph extraction and LLM-driven proof attempts to achieve high recall and precision.
Empirical evaluations show SPECA’s cost-efficiency and superior detection of vulnerabilities, outperforming traditional code-centric auditors.

Specification-Anchored Auditing: SPECA Framework for Multi-Implementation Distributed Protocols

Introduction

Auditing distributed protocols with multiple independent implementations presents unique challenges: code-driven auditing tools excel at detecting localized code anomalies but fail to capture vulnerabilities that arise from violations of specification-level invariants not explicitly encoded in code. This limitation is acute in systems like blockchain networks, where protocol correctness depends on all clients faithfully implementing a natural-language specification. The paper "Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols" (2604.26495) introduces SPECA, a LLM-driven framework for auditing such specification-governed systems. SPECA systematically derives, categorizes, and verifies explicit security properties from natural-language specifications, providing a methodology and infrastructure for high-assurance, cross-implementation security analysis.

Framework Architecture

SPECA is architected as a six-phase pipeline, divided into a knowledge structuring stage and a systematic auditing stage.

Figure 1: The SPECA pipeline. Knowledge structuring (Phases 1–3) derives security properties from specifications; systematic auditing (Phases 4–6) verifies these properties against implementations. In multi-implementation settings, the structuring stage executes once, while auditing repeats per implementation.

Pipeline Phases

Phases 1–3 (Knowledge Structuring): Automatically crawl specifications, extract program graphs, and synthesize security properties (invariants, preconditions, postconditions, and trust assumptions) using threat modeling approaches informed by STRIDE and CWE taxonomies.
Phases 4–6 (Systematic Auditing): Link properties to implementation code using structural analysis, perform property-grounded audits via proof-attempt reasoning (LLM-driven structured evidence construction), and filter false positives through three recall-safe gates (dead code, trust boundary, scope check).

This architecture enforces a decoupled, specification-anchored audit regimen, enabling shared property vocabularies across disparate implementations and traceable attribution of both findings and false positives to specific pipeline stages.

Property-Grounded Reasoning

The pivotal innovation in SPECA is the audit layer that treats every specification-derived property as a discharge obligation: for each property, SPECA attempts to construct a Hoare-style proof at all relevant program points, and gaps in this proof are flagged as findings.

Figure 2: Phase 5 audit flow for a single property—the property is decomposed, mapped to code, sub-claims are discharged, and plausible attack scenarios are validated.

This method deconfounds code-local pattern recognition with explicit, program-point-bound obligations. The modular typing of properties (invariant, precondition, postcondition, trust assumption) aligns the proof effort with program semantics, enabling fine control over recall and precision. By grounding every finding in a specification-derived obligation, SPECA generates semantically traceable reports and, crucially, explains false positives by their pipeline-phase provenance.

Empirical Evaluation and Results

RQ1: Auditing Effectiveness and Failure Analysis

On the Ethereum Sherlock Fusaka Audit Contest benchmark (10 diverse clients, 366 submissions), SPECA, in its expert-augmented instantiation, recovers all 15 adjudicated H/M/L vulnerabilities (recall: 100%; automated-only: 8/15, 53%). Four additional high-impact bugs (including a critical cryptographic invariant violation in KZG proofs not found by any contest submission) are independently surfaced and confirmed by developer fix commits.

Precision is nuanced by the “property neighborhood effect”—multiple complementary properties per issue:

Figure 3: Findings per issue—most issues are detected by neighborhoods of two or more properties, which explains both high recall and the gap between finding-level and cluster-level precision.

Cluster-level strict precision is 48.7%, nearly double the raw finding-level strict precision, reflecting genuine property redundancy rather than alert duplication.

RQ2: Comparison to Code-Driven Auditors

On the RepoAudit C/C++ benchmark, SPECA with Claude Sonnet 4.5 matches the best published precision (88.9%) at 100% recall and, uniquely, surfaces 12 author-validated, beyond-ground-truth (GT) bugs, with two externally confirmed at developer or maintainer level. In a controlled model-backbone comparison, SPECA maintains competitive cost-efficiency ($\sim\$1.69$/bug), demonstrating that property-centric auditing enables both parity with and differentiated capabilities over code-centric agents.

Figure 4: SPECA cost vs detection performance on RepoAudit. SPECA uniquely contributes externally validated, beyond-GT candidates while maintaining competitive precision and cost.

SPECA also produces explicit verdicts for disputed patterns, e.g., defensive-coding fixes with no reachable exploit path, and justifies these decisions by explicit traceable property analysis—an interpretability advantage over code-only tools.

Cross-Implementation Comparison and Robustness

By fixing the property vocabulary across clients, SPECA enables controlled cross-implementation comparison. The same properties are reused across multiple repositories, producing systematic differences in TP/FP distributions determined solely by implementation divergences from the specification.

Figure 5: Per-repository finding distributions—shared property vocabularies reveal implementation-specific security profiles, which code-local tools cannot produce.

This approach exposes structural misalignments between specification and individual implementations, providing actionable insights for protocol maintainers.

Failure Mode Analysis

All false positives are mapped to one of three phase-localized root causes: trust boundary misclassification (specification vs code-level trust demarcation), code reading errors (incorrect code reasoning by the LLM), and specification misinterpretation (ambiguous or misread requirements). No opaque, unexplainable model failures persist post-analysis.

Figure 6: False-positive root-cause taxonomy—distinct origins localized to pipeline phases facilitate targeted improvements.

Notably, trust assumptions are found to be too noisy for reliable automated auditing but valuable for exploratory triage, while invariants and preconditions are the most productive property types, reaching 75% and 100% precision, respectively.

Implications and Future Research

SPECA establishes several key implications:

Specification-as-Oracle: Automated audits using only code-level signals miss vulnerabilities rooted in specification-level requirements. Explicit, reusable, and typed property vocabularies extracted from specifications are necessary for comprehensive auditing in distributed systems.
Interpretability and Auditable FPs: SPECA’s structured, phase-annotated error taxonomy transforms opaque LLM errors into ranked engineering targets, making pipeline improvements tractable and audit outcomes explainable and actionable.
Model Capability and Coverage: As underlying LLMs increase in capability, property-generation quality becomes the binding constraint on vulnerability coverage, not model reasoning per se. Investments in improved natural-language-to-property extraction—especially for cryptographic invariants and protocol-lifecycle properties—will yield future gains.
Cost Structure: The amortization of property-generation effort across implementations and the practical per-bug cost ( $\sim$ \$30 for H/M/L, amortized further at scale) places specification-anchored auditing well within operational feasibility for high-assurance applications.

Open problems include scalable, high-coverage, automated property extraction for mathematical and lifecycle invariants and generalized extension to protocol families beyond Ethereum.

Conclusion

SPECA demonstrates that specification-anchored, LLM-driven auditing is fundamentally more expressive, robust, and analyzable than code-centric approaches for multi-implementation distributed protocols. By encoding and verifying explicit properties from authoritative specifications, SPECA recovers all adjudicated vulnerabilities, discovers high-severity bugs missed by human auditors, and offers full traceability for both true and false positives. The framework’s formal attribution of findings shifts the principal research barrier to high-recall automated property generation. Future progress depends on advances in robust, specification-to-property NLP methods and their operational integration with high-capability LLMs. SPECA’s architectural principles, evaluation methodology, and empirical evidence provide a rigorous foundation for future tools and research in specification-governed security auditing.

Markdown Report Issue