Cross-Chain Event-Level Datasets

Updated 20 December 2025

Cross-chain event-level datasets are systematically curated collections of granular blockchain events linked across multiple networks, enabling high-fidelity tracing and interoperability research.
They employ standardized schemas—such as Datalog facts, semantic quintuples, and canonical flat tables—to ensure consistent event representation and effective data pairing.
Robust extraction and preprocessing pipelines, including deduplication and temporal alignment, support scalable real-time analytics and forensic security assessments in DeFi.

A cross-chain event-level dataset is a systematically curated collection of granular blockchain events—such as token deposits, withdrawals, swaps, and contract state transitions—linked across multiple blockchain networks and interoperability protocols. These datasets, typically capturing bridge and DeFi protocol operations, encode structured information at the event (log) level, enabling high-fidelity tracing, security analysis, financial benchmarking, and decentralized systems research. Recent advances have produced open-source pipelines and standardized schemas for extracting, decoding, and pairing such events, supporting rigorous study of interoperability, MEV phenomena, and systemic risk.

1. Dataset Scope and Blockchain Coverage

Contemporary cross-chain event-level datasets typically span multiple chains and bridge protocols, with coverage reflected in both breadth (number of chains/bridges) and depth (temporal and event-type resolution). Example instances:

XChainWatcher (Augusto et al., 2 Oct 2024): 81,000 cross-chain transactions (cctxs) across Ethereum, Moonbeam, and Ronin; includes Nomad and Ronin bridge events from 2021–2022, aggregating $4.2B in token transfers.
ConneX (Liang et al., 3 Nov 2025): Over 500,000 paired transactions, spanning Stargate, DLN, Multichain, Celer, Poly bridges, and five major blockchains.
Bunny Hops and Blockchain Stops (Mancino et al., 24 Oct 2025): 2.4B transactions datasets (DEX swaps, bridge deposits/withdrawals) over 12 chains including Ethereum, BSC, Polygon, Solana, Arbitrum, Optimism, Avalanche, Base, Gnosis, Near, Osmosis, and Blast; 45 distinct bridge protocols.
Aave V3 Cross-Chain Event Dataset (Fan et al., 12 Dec 2025): 50M decoded lending protocol event records over Ethereum, Arbitrum, Optimism, Polygon, Avalanche, and Base.
XChainDataGen (Augusto et al., 17 Mar 2025): 11,285,753 synthetic cctxs produced from raw events collected over 11 blockchains and five bridge protocols.

The temporal resolution is typically strict, often including full block ranges (with configurable finality assurance) and explicit time windows; practical acquisition strategies involve event log polling, blockdaemon APIs, and JSON-RPC endpoints.

2. Schema Design and Event Representation

Cross-chain event-level datasets organize raw blockchain events and linked transaction tuples using a standardized schema. Key approaches include:

Datalog Fact Model (XChainWatcher): Each event is encoded as a Datalog fact, e.g., sc_deposit(tx_hash, chain_id, event_idx, from, to, amount), representing atomic operations (lock/burn, mint/unlock, withdrawal) and contracts' controlled addresses.
Semantic Quintuple (ConneX): Each paired cross-chain transfer is distillable to a five-element tuple: amount ( $K_A$ ), token ( $K_T$ ), destination ( $K_D$ ), counterpart chain ( $K_C$ ), and timestamp ( $K_{Ts}$ ). The model supports high-fidelity semantic mapping via ABI-based field extraction and LLM-driven filter modules.
Canonical Flat Schema (Bunny Hops): Event records capture block and transaction identifiers, contract/event types, participant addresses, input/output token/amounts, USD-normalized values, gas data, and raw log traces, providing compatibility across swaps and bridge events.
CCTX Unified Table (XChainDataGen): Synthetic cross-chain transaction records (cctx_id, protocol, source/dest chain, tx_hashes, contract/event, participant addresses, token/amount pairs, timestamps, status) join paired deposits/burns and mints/unlocks.
Aave V3 Lending Protocol Schema: Per-event tables capturing core lending events (Supply, Borrow, Withdraw, Repay, LiquidationCall, FlashLoan, ReserveDataUpdated, MintedToTreasury) with user/asset/amount fields, block/time metadata, and on-chain USD valuations via oracle lookups.

Absence of a corresponding fact or inability to match event pairs flags failed/mismatched attempts, supporting forensic anomaly detection.

3. Extraction, Pairing, and Preprocessing Pipelines

Robust pipeline architectures enable the reliable generation and validation of cross-chain event-level datasets:

Extraction: Raw logs are fetched using RPC APIs (eth_getLogs), complemented by transaction receipts and block metadata. ABIs map encoded topics/data to structured events. Tools like XChainDataGen provide a CLI for extraction/generation and YAML-configurable module designs (Augusto et al., 17 Mar 2025).
Preprocessing: Includes deduplication by transaction hash/log index, timestamp alignment, token decimal normalization, and USD valuation (via price oracle integration). Datasets maintain strict ordering by block number, timestamp, and log index, supporting deterministic reconstruction and reproducibility (Fan et al., 12 Dec 2025).
Pairing: Synthetic cross-chain transaction tuples (CCTX) are constructed by linking source and destination chain events using bridge-specific IDs (depositId, messageId) and sanitation checks for participant, token, and amount fields. Semantic pairing of events (e.g., ConneX) employs LLM-based pruning to reduce field search spaces from $>10^{10}$ to $<100$ mappings, with examiner modules validating value/field consistency (Liang et al., 3 Nov 2025).
Latency and Finality: Configurable finality models (hard/soft) enforce the number of block confirmations before marking an event as final, trading off reorg safety vs. data completeness. Aggregators capture events straddling interval edges for buffer accuracy.

4. Quantitative Metrics and Analytical Use Cases

Cross-chain event-level datasets enable diverse empirical and inferential analyses:

Aggregate Metrics: Total transfer volume ( $V = \sum_i v_i$ ), average transaction sizes ( $\bar{v} = V/N$ ), lock–release ratios ( $R_{lock} = \frac{\sum locks}{\sum releases}$ ), and failure rates ( $F = \frac{\#\ \text{incomplete cctx}}{\#\ \text{total attempted cctx}}$ ) (Augusto et al., 2 Oct 2024, Augusto et al., 17 Mar 2025).
Latency Analysis: Round-trip cross-chain latency computed as $L = \frac{1}{N} \sum_i (t^{(i)}_{dst\_finality} - t^{(i)}_{src\_init})$ (Augusto et al., 17 Mar 2025); event-level timings yield mean/variance statistics per protocol, e.g., Nomad deposit latency $\mu=2,579s$ , Ronin withdrawal latency $\mu=330s$ (Augusto et al., 2 Oct 2024, Mancino et al., 24 Oct 2025).
Visualization: Scatter plots (latency vs. value), time series of event counts, and histograms (locked balances) diagnose systematic inefficiencies and “abandoned” or stuck transfers (e.g., $4.8M$ locked in incomplete Ronin withdrawals) (Augusto et al., 2 Oct 2024).
Security and Forensics: Datasets support anomaly detection (mismatched/failed attempts, reorg-induced splits), attack tracing (hack chain mapping as in ConneX Bybit/Upbit examples), and MEV arbitrage search (n-hop event graph path-finding, profit formula $\Pi(A) = v_{out}(t_{2n-1}) - v_{in}(t_1)$ ) (Liang et al., 3 Nov 2025, Mancino et al., 24 Oct 2025).
DeFi Risk Modelling: Lending datasets enable analyses of liquidation dynamics, user migration across chains, and systemic risk metrics (e.g., liquidation frequency, Herfindahl decomposition, simulated price-shock health factor recomputation) (Fan et al., 12 Dec 2025).

Access patterns exploit SQL, Python (pandas, pyarrow), and column-store formats (Parquet) for scalable computational querying.

5. Technical Challenges and Engineering Solutions

Dataset construction and operationalization present several technical obstacles:

Clock Skew and Block Time Variability: Different chains feature distinct block times and timestamp accuracies. Solutions include global time-key normalization $(timestamp_{utc}, chain\_id, block\_number, tx\_index)$ , with small $\Delta t$ windows for event linkage (Mancino et al., 24 Oct 2025).
Chain Reorgs and Deduplication: Reorganizations yield “zombie” logs; robust ingestion requires post-confirmation verification, configurable block delay, and periodic “finality sweeps” to repair affected cctx pairs (Augusto et al., 17 Mar 2025).
Cross-Chain Latency and Bridge Finality: Bridge operations may require destination-side proofs; only observed withdrawals with confirmed status are linked, with modular configuration per protocol (Mancino et al., 24 Oct 2025).
Semantic Pairing and Field Mapping: High-dimensional field matching is computationally intensive. LLM-based filtering and examiner validation modules address combinatorial search (reducing $>10^{10}$ combinations to $<100$ practical mappings) (Liang et al., 3 Nov 2025).
Extensibility and Modularization: Registry architectures (YAML, pluggable modules) facilitate new chain/bridge integrations; partitioning (by chain, date, event type) and open-source pipelines (Docker, Git-versioned configs) ensure reproducibility and performance at scale (Augusto et al., 17 Mar 2025).

6. Limitations, Extension Opportunities, and Research Directions

Current datasets manifest a number of recognized limitations:

Bridge and Asset Scope: Many datasets focus on fungible (ERC-20, native) tokens; non-fungible assets (ERC-721), novel protocols, and advanced intent-based systems (e.g., Across) remain underrepresented (Augusto et al., 2 Oct 2024).
Proof Semantics: Off-chain verification paradigms (fraud proofs, ZK-proofs) are often abstracted; formal integration of proof-verification logic would enrich security and audit capacity (Augusto et al., 2 Oct 2024).
Aggregator and Internal Flow Coverage: Intermediary protocols (aggregators) can obscure transaction flows; deeper trace extraction and advanced block/transaction graph analysis are advised.
Streaming and Real-Time Analytics: Expanding from backfill archival pipelines to real-time monitoring (Kafka, Flink, kdb+) is recommended for dynamic security and MEV detection (Mancino et al., 24 Oct 2025).
User Experience and Behavioral Metrics: Datasets increasingly feed into user abandonment, capital migration, and cross-chain behavioral analysis, supporting DeFi, MEV, and protocol-level optimization (Fan et al., 12 Dec 2025).

Planned extensions include support for additional chains (Avalanche, BSC, Solana), enhanced asset standard coverage, and improved interface for empirical and simulation-based research.

7. Summary Table of Representative Dataset Properties

Dataset / Tool	Chains/Bridges	Events / Volume	Access Format
XChainWatcher	ETH, GLMR, RON; Nomad/Ronin	81,000 cctxs, $4.2B	CSV, JSONL, Parquet, Datalog (.facts)
ConneX	5 bridges, 5 chains	503,627 paired tx	Relational, Parquet, JSON
Bunny Hops MEV	12 chains, 45 bridges	2.4B tx, 34M bridge	Flat Table, Parquet, Partitioned Index
XChainDataGen	11 chains, 5 bridges	11M cctxs, $28B	SQL/SQLite, Parquet, Custom CLI
Aave V3 Analytics	6 chains (EVM)	50M lending events	CSV shards, Chron. JSON index

These datasets provide the empirical substrate for computational finance, blockchain interoperability, DeFi systemic risk research, and cross-chain security auditing at scale.