Papers
Topics
Authors
Recent
Search
2000 character limit reached

Invariant Catalog Mining Pipeline

Updated 3 January 2026
  • Invariant catalog mining pipeline is an automated workflow that extracts and categorizes defensive invariants from Ethereum smart contracts to identify transaction reverts.
  • It employs dynamic execution tracing, semantic embeddings using RavenBERT, and clustering techniques (DBSCAN, K-Means, HDBSCAN) to group invariant predicates.
  • The pipeline maps on-chain revert events to source code, enabling practical security applications like automated vulnerability detection and fuzzing oracle construction.

An invariant catalog mining pipeline is an automated workflow for discovering, embedding, clustering, and semantically categorizing defensive invariants responsible for transaction reverts in Ethereum smart contracts. By mining large-scale blockchain execution traces and aligning reverted transactions to the exact invariants in contract source code, the pipeline extracts and organizes guard conditions that encode successful on-chain defenses. Approaches such as the Raven framework operationalize this pipeline at scale, ultimately constructing catalogs of invariant categories that encapsulate core security patterns, including novel defenses absent from prior literature (Eshghie et al., 27 Dec 2025).

1. Data Acquisition and Invariant Extraction

The pipeline begins with extensive monitoring of on-chain Ethereum transactions, filtering for failure modes indicative of invariant enforcement (excluding out-of-gas or arithmetic errors). Dynamic execution tracing—using platforms such as the Tenderly API—captures stack frames up to the transaction revert point. The relevant Solidity source is disambiguated through binary offset mapping, allowing the extraction of the precise predicate in require(predicate), assert(predicate), or if (predicate) revert constructs. For each reverted transaction, attributes such as predicate string, error message, contract identifier, source file, and transaction metadata are recorded. From 20,000 sampled failures, this process yielded 12,222 revert-by-invariant events and 727 unique invariant predicates (Eshghie et al., 27 Dec 2025).

2. Alignment from Transaction Failures to Source Invariants

Accurate mapping of revert events to smart contract source code is essential. The pipeline aligns the final stack frame—providing the binary instruction offset—with verified contract sources via Etherscan. The predicate (logical condition) triggering the revert is extracted and deduplicated at the string level, resulting in a unique set of invariants that are actively enforced and observed to fail on-chain. This alignment stage distinguishes invariant predicates by their actual deployment context, as opposed to static or unused code artifacts.

3. Semantic Embedding of Invariant Predicates

The core of the pipeline is the semantic embedding of invariant predicates. Each unique predicate ii is encoded as a vector eiRd\mathbf{e}_i \in \mathbb{R}^d using RavenBERT, a contrastively fine-tuned transformer (pretrained on SmartBERT-v2). The formal mapping is ei=BERTθ(invarianti)\mathbf{e}_i = \mathrm{BERT}_\theta(\text{invariant}_i). The final embedding space is L2-normalized to ensure cosine similarity as the natural distance metric for subsequent clustering. Contrastive learning is employed during fine-tuning: predicate pairs with cosine similarity greater than 0.8 are labeled as positives, and the loss function pulls positives together while pushing negatives apart on the unit sphere. Lexical embeddings (TF–IDF, CodeBERT, baseline SmartBERT-v2) serve as non-semantic baselines, but did not match the cohesion and interpretability achieved by the fine-tuned model (Eshghie et al., 27 Dec 2025).

4. Clustering and Catalog Construction

The pipeline performs clustering over the embedded invariant space to group semantically related predicates. Methods implemented include:

  • K-Means (partitioning by within-cluster variance minimization)
  • DBSCAN (density-based clustering; parameters: neighborhood radius ε and minimum samples)
  • HDBSCAN (hierarchical DBSCAN; parameter: minimum cluster size)

Cluster validation employs multiple objective metrics:

  • Silhouette Coefficient s(i)=b(i)a(i)max{a(i),b(i)}s(i) = \frac{b(i)-a(i)}{\max \{ a(i), b(i) \}}: measures cohesion and separation; ranges from -1 to 1.
  • S_Dbw: combines intra-cluster scatter and inter-cluster density overlap.
  • Coverage: proportion of unique invariants assigned non-noise clusters.

Model selection involves a grid search over algorithm parameters and embedding/model configurations, with criteria (admissibility filters) ensuring cluster count between 8 and 100, coverage ≥ 50%, and run-to-run stability. The best configuration for intrinsic quality uses RavenBERT embeddings and DBSCAN on predicate-only representations, achieving Silhouette 0.93, S_Dbw 0.043, and coverage 51.9% (378/727 clustered invariants) (Eshghie et al., 27 Dec 2025).

5. Invariant Category Taxonomy and Novel Discoveries

Resultant clusters are manually labeled via expert review, yielding 19 distinct invariant categories. Six categories are novel with no previous catalog counterparts: Caller-Provided Slippage Thresholds, Feature Toggles, Replay Prevention, Proof/Signature Verification, Allow/Ban/White/Blacklist Gates, and Counters/Nonces. These account for 30.3% of revert-by-invariant events in the test set. Detailed examples and functional descriptions for each cluster illuminate the dominant defensive semantics in DeFi contracts.

Cluster # Category Example Predicate
2 [novel] Caller-Provided Slippage Thresholds amountOut ≥ amountOutMin
6 [novel] Feature Toggles tradeEnabled == true
7 [novel] Replay Prevention !usedClaims[claimLeaf]
9 [novel] Proof/Signature Verification !MerkleProof.verify(proof, root, leaf)
12 [novel] Allow/Ban/White/Blacklist Gates !isBlacklisted[msg.sender]
14 [novel] Counters/Nonces allowed.nonce != nonce

Manual examination confirmed ≥90% semantic homogeneity per cluster, and all clusters were found to be non-overlapping (Eshghie et al., 27 Dec 2025).

6. Downstream Security Applications

The categorization of invariants enables practical data-driven security analysis. Case studies have demonstrated direct pipeline-to-tooling translation. For example, the “Proof Verification” category facilitated the creation of an automated fuzzing oracle that detected the root cause of the 2022 Nomad Bridge exploit. The derived oracle enforces the postcondition m. process(m)    proved(m)\forall m.\ \texttt{process}(m)\;\Rightarrow\;\texttt{proved}(m), which, when used in a Foundry-based fuzzing workflow, surfaced a two-step counterexample matching the real-world vulnerability scenario.

Beyond fuzzing, the invariant catalog can serve as a foundation for automated vulnerability detection, defense auditing, and mining of emergent on-chain security practice. The combination of semantic clustering and high-coverage invariant extraction provides an empirical basis for constructing security oracles and informally verified guard conditions (Eshghie et al., 27 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Invariant Catalog Mining Pipeline.