- The paper introduces SPECA, a framework that systematically derives and verifies specification-based security properties for distributed protocols.
- It employs a six-phase pipeline combining program graph extraction and LLM-driven proof attempts to achieve high recall and precision.
- Empirical evaluations show SPECA’s cost-efficiency and superior detection of vulnerabilities, outperforming traditional code-centric auditors.
Specification-Anchored Auditing: SPECA Framework for Multi-Implementation Distributed Protocols
Introduction
Auditing distributed protocols with multiple independent implementations presents unique challenges: code-driven auditing tools excel at detecting localized code anomalies but fail to capture vulnerabilities that arise from violations of specification-level invariants not explicitly encoded in code. This limitation is acute in systems like blockchain networks, where protocol correctness depends on all clients faithfully implementing a natural-language specification. The paper "Beyond Code Reasoning: Specification-Anchored Auditing of Multi-Implementation Distributed Protocols" (2604.26495) introduces SPECA, a LLM-driven framework for auditing such specification-governed systems. SPECA systematically derives, categorizes, and verifies explicit security properties from natural-language specifications, providing a methodology and infrastructure for high-assurance, cross-implementation security analysis.
Framework Architecture
SPECA is architected as a six-phase pipeline, divided into a knowledge structuring stage and a systematic auditing stage.
Figure 1: The SPECA pipeline. Knowledge structuring (Phases 1–3) derives security properties from specifications; systematic auditing (Phases 4–6) verifies these properties against implementations. In multi-implementation settings, the structuring stage executes once, while auditing repeats per implementation.
Pipeline Phases
- Phases 1–3 (Knowledge Structuring): Automatically crawl specifications, extract program graphs, and synthesize security properties (invariants, preconditions, postconditions, and trust assumptions) using threat modeling approaches informed by STRIDE and CWE taxonomies.
- Phases 4–6 (Systematic Auditing): Link properties to implementation code using structural analysis, perform property-grounded audits via proof-attempt reasoning (LLM-driven structured evidence construction), and filter false positives through three recall-safe gates (dead code, trust boundary, scope check).
This architecture enforces a decoupled, specification-anchored audit regimen, enabling shared property vocabularies across disparate implementations and traceable attribution of both findings and false positives to specific pipeline stages.
Property-Grounded Reasoning
The pivotal innovation in SPECA is the audit layer that treats every specification-derived property as a discharge obligation: for each property, SPECA attempts to construct a Hoare-style proof at all relevant program points, and gaps in this proof are flagged as findings.
Figure 2: Phase 5 audit flow for a single property—the property is decomposed, mapped to code, sub-claims are discharged, and plausible attack scenarios are validated.
This method deconfounds code-local pattern recognition with explicit, program-point-bound obligations. The modular typing of properties (invariant, precondition, postcondition, trust assumption) aligns the proof effort with program semantics, enabling fine control over recall and precision. By grounding every finding in a specification-derived obligation, SPECA generates semantically traceable reports and, crucially, explains false positives by their pipeline-phase provenance.
Empirical Evaluation and Results
RQ1: Auditing Effectiveness and Failure Analysis
On the Ethereum Sherlock Fusaka Audit Contest benchmark (10 diverse clients, 366 submissions), SPECA, in its expert-augmented instantiation, recovers all 15 adjudicated H/M/L vulnerabilities (recall: 100%; automated-only: 8/15, 53%). Four additional high-impact bugs (including a critical cryptographic invariant violation in KZG proofs not found by any contest submission) are independently surfaced and confirmed by developer fix commits.
Precision is nuanced by the “property neighborhood effect”—multiple complementary properties per issue:
Figure 3: Findings per issue—most issues are detected by neighborhoods of two or more properties, which explains both high recall and the gap between finding-level and cluster-level precision.
Cluster-level strict precision is 48.7%, nearly double the raw finding-level strict precision, reflecting genuine property redundancy rather than alert duplication.
RQ2: Comparison to Code-Driven Auditors
On the RepoAudit C/C++ benchmark, SPECA with Claude Sonnet 4.5 matches the best published precision (88.9%) at 100% recall and, uniquely, surfaces 12 author-validated, beyond-ground-truth (GT) bugs, with two externally confirmed at developer or maintainer level. In a controlled model-backbone comparison, SPECA maintains competitive cost-efficiency ($\sim\$1.69$/bug), demonstrating that property-centric auditing enables both parity with and differentiated capabilities over code-centric agents.
Figure 4: SPECA cost vs detection performance on RepoAudit. SPECA uniquely contributes externally validated, beyond-GT candidates while maintaining competitive precision and cost.
SPECA also produces explicit verdicts for disputed patterns, e.g., defensive-coding fixes with no reachable exploit path, and justifies these decisions by explicit traceable property analysis—an interpretability advantage over code-only tools.
Cross-Implementation Comparison and Robustness
By fixing the property vocabulary across clients, SPECA enables controlled cross-implementation comparison. The same properties are reused across multiple repositories, producing systematic differences in TP/FP distributions determined solely by implementation divergences from the specification.
Figure 5: Per-repository finding distributions—shared property vocabularies reveal implementation-specific security profiles, which code-local tools cannot produce.
This approach exposes structural misalignments between specification and individual implementations, providing actionable insights for protocol maintainers.
Failure Mode Analysis
All false positives are mapped to one of three phase-localized root causes: trust boundary misclassification (specification vs code-level trust demarcation), code reading errors (incorrect code reasoning by the LLM), and specification misinterpretation (ambiguous or misread requirements). No opaque, unexplainable model failures persist post-analysis.
Figure 6: False-positive root-cause taxonomy—distinct origins localized to pipeline phases facilitate targeted improvements.
Notably, trust assumptions are found to be too noisy for reliable automated auditing but valuable for exploratory triage, while invariants and preconditions are the most productive property types, reaching 75% and 100% precision, respectively.
Implications and Future Research
SPECA establishes several key implications:
- Specification-as-Oracle: Automated audits using only code-level signals miss vulnerabilities rooted in specification-level requirements. Explicit, reusable, and typed property vocabularies extracted from specifications are necessary for comprehensive auditing in distributed systems.
- Interpretability and Auditable FPs: SPECA’s structured, phase-annotated error taxonomy transforms opaque LLM errors into ranked engineering targets, making pipeline improvements tractable and audit outcomes explainable and actionable.
- Model Capability and Coverage: As underlying LLMs increase in capability, property-generation quality becomes the binding constraint on vulnerability coverage, not model reasoning per se. Investments in improved natural-language-to-property extraction—especially for cryptographic invariants and protocol-lifecycle properties—will yield future gains.
- Cost Structure: The amortization of property-generation effort across implementations and the practical per-bug cost (∼\$30 for H/M/L, amortized further at scale) places specification-anchored auditing well within operational feasibility for high-assurance applications.
Open problems include scalable, high-coverage, automated property extraction for mathematical and lifecycle invariants and generalized extension to protocol families beyond Ethereum.
Conclusion
SPECA demonstrates that specification-anchored, LLM-driven auditing is fundamentally more expressive, robust, and analyzable than code-centric approaches for multi-implementation distributed protocols. By encoding and verifying explicit properties from authoritative specifications, SPECA recovers all adjudicated vulnerabilities, discovers high-severity bugs missed by human auditors, and offers full traceability for both true and false positives. The framework’s formal attribution of findings shifts the principal research barrier to high-recall automated property generation. Future progress depends on advances in robust, specification-to-property NLP methods and their operational integration with high-capability LLMs. SPECA’s architectural principles, evaluation methodology, and empirical evidence provide a rigorous foundation for future tools and research in specification-governed security auditing.