Bitcoin Transaction Data 2020-2024 Analysis

Updated 12 September 2025

Bitcoin Transaction Data (2020–2024) is a comprehensive on-chain dataset comprising billions of records and intricate graph structures for mapping economic flows.
The data is processed using custom parsers and graph databases to enable temporal analytics, behavioral modeling, and fraud detection with high accuracy.
It reveals key trends in wealth distribution, fee dynamics, and protocol scalability, providing actionable insights for forensic, econometric, and network analysis.

Bitcoin Transaction Data (2020–2024) refers to the aggregate on-chain information derived from all economic and protocol activities within the Bitcoin network over this period. The dataset encompasses billions of transaction records, transfer edges, and address or entity behaviors—each traceable to cryptographically signed block entries. Researchers leverage this granular, append-only structure to conduct network analytics, behavioral modeling, fraud detection, economic analysis, and scalability profiling, enabled by Bitcoin’s public permissionless ledger (McGinn et al., 2018).

1. Structure and Graph Representation

Bitcoin’s transaction data is inherently complex, consisting of binary block files lacking standardized keys and containing variable-length structures. Conversion of this raw binary into a high-fidelity graph model facilitates rigorous analysis of network relationships and economic flows (McGinn et al., 2018). Common practice is to employ a custom C++ parser to deserialize sequential block files and extract all inputs, outputs, and relevant metadata, subsequently storing these elements in a graph database (e.g., Neo4j).

Vertices in the transaction graph correspond to blocks, transactions, inputs, outputs, and addresses, while edges encode their relationships:

Block → transaction (containment)
Transaction input → previous output (spending linkage)
Output → address (recipient)
Input → address (spender)

Recent datasets, notably (Schnoering et al., 15 Nov 2024), encode more than 250 million nodes and 785 million edges covering up to 670 million transactions. Each node and edge carries timestamp and value attributes, allowing for temporal queries and network evolution studies.

2. Transaction Patterns and Behavioral Analytics

Edge-weighted adjacency matrices constructed from transaction graphs reveal repeated, temporally localized patterns (McGinn et al., 2018). The observed “DNA sequence” structures—where a block collects or distributes values along horizontal/vertical strands—permit the linkage of user activities across blocks, highlighting behavioral “fingerprints” even under pseudonymity.

Address behavior datasets such as BABD-13 (Xiang et al., 2022) categorize addresses into 13 types (e.g., Blackmail, Exchange, Ponzi, Mining Pool, Tumbler, etc.) using pure amount, degree, time, and combination indicators, further enriched by local structural features (clustering, betweenness, PageRank). K-hop subgraph construction enables targeted analysis of transactional neighborhoods originating from a specific address.

Forecasting models such as DLForecast (Wei et al., 2020) exploit both time-decaying reachability and weighted transaction pattern graphs, learning dynamic node embeddings via random walk and skip-gram methods. This enables the prediction of future transaction edges with accuracy improvements of over 50% compared to static baselines.

3. Wealth Distribution, Dynamics, and Decentralization

The distribution of Bitcoin balances among users or entities has been shown to follow a log-normal form, diverging from pure power-law models (Zhang et al., 16 Sep 2024). Deviations from Gibrat’s proportional growth law are evident, with exponent parameters in drift and variance terms distinctly not equal to unity. Analysis divides users into “poor” (small initial balance, rapid accumulation followed by complete sell-off) and “wealthy” (large initial balance, gradual, incomplete liquidation). Regression models introduce balance-specific exponents and time-dependent drift parameters to capture these heterogeneous behaviors.

Studies of network-level asset dispersion (Cheng et al., 19 Nov 2024, Venturini et al., 20 Jan 2025) employ centrality measures (betweenness, closeness, in-degree, PageRank) and concentration indices (Herfindahl-Hirschman Index, custom decentralization degree formulas) to quantify wealth and network centralization. Findings indicate that by 2020–2024, the system enters a mature phase: the top 1% of addresses retain roughly 50% of network flow, and ranking stability (Spearman coefficient) is high, especially in top tiers. While the distribution curve has flattened, large players—exchanges or pools—remain prominent, and concentration ratios remain persistently elevated.

4. Statistical Properties and Econometric Insights

The velocity of Bitcoin circulation is empirically estimated via “dwell time”—the weighted age of consumed outputs (McGinn et al., 2018). Dwell time remains nearly constant over the years, implying stable velocity contrary to the declining monetary supply due to halving events. Application of the quantity theory of money ( $MV=PT$ ) suggests that if $M$ (monetary supply) decreases and $V$ (velocity) is stable, the price level $P$ should fall for constant $T$ (transaction volume), consistent with observations of increasing purchasing power.

Cohort analysis partitions the spent and unspent output sets into daily birth, death, and age cohorts (Liu et al., 2021). Weighted average lifespan (WAL) of spent outputs peaks during market stress, while more than 80% of outputs are spent within a day, reflecting high turnover and a dominant medium-of-exchange role. In parallel, long-held unspent outputs indicate a store-of-value function, and daily cohort updating facilitates efficient extension of analysis into future blocks.

5. Criminal Activity Detection and Forensics

Robust detection of abnormal addresses or illicit activity depends on high-dimensional feature engineering. Moment-based time features, network statistics, and local subgraph topology improve classifier performance in address type prediction—LightGBM yields Micro-F1 scores up to 87% (Lin et al., 2019) and XGBoost up to 97.13% (Xiang et al., 2022).

Graph-based fraud analysis using the Elliptic++ dataset (Elmougy et al., 2023) constructs four interlinked graph types: transaction-to-transaction, address-to-address, address-transaction, and user entity graphs. Random Forests with ensemble refinement yield precision near 99% for transactions and over 92% for address-level actor detection. Temporal splitting of data ensures generalization; feature importance analysis and subgraph visualization support explainable anti-money laundering and anomaly investigations.

6. Scalability, Fee Estimation, and Protocol Features

Transaction size estimation libraries, notably libtxsize (Hofmann, 2021), analytically model the byte, vbyte, and weight requirements for all input/output/witness types, supporting legacy, SegWit, and anticipated Taproot formats. Empirical validation demonstrates close alignment between modeled and observed sizes, and formulas such as $W_\text{size} = 72.5m + 34n + 6$ (for multisig SegWit) or 66 bytes for Taproot key-path provide inputs for block simulation models.

Block size, transaction count, and average fee per transaction are variable over time, with clear weekday/weekend effects and sensitivity to mempool congestion (Gebraselase et al., 2021). While ARIMA/NARX models can predict aggregate block properties with low MAE/RMSE, confirmation times and inter-block generation intervals follow exponential (memoryless) distributions, limiting predictive power. Miner classification is feasible for unique pools (e.g., F2Pool) via boosted decision trees.

7. Comparative Data Models and Systemic Security

Bitcoin’s public permissionless model distinguishes itself from private permissioned ledgers by enabling unrestricted read-write access, cryptographic and economic consensus, and native auditability (McGinn et al., 2018). The open data architecture underpins coordinated community defenses against DoS and transaction-spam anomalies—atypical structures (worm signatures, high out-degree/in-degree nodes) are detected via graph-based anomaly metrics.

Scaling remains inherently challenging: every participant must validate global transactions and block size limits cap raw throughput. Permissioned chains can optimize performance and confidentiality at the expense of analytical transparency and forensic capability.

Conclusion

Between 2020 and 2024, Bitcoin transaction data exhibits:

Persistent centralization and log-normal wealth distributions with heterogeneous behavioral regimes.
Stable, mature network metrics alongside heavy-tailed degree distributions and core-periphery structural patterns.
Reproducible, open datasets and high-fidelity graph models enabling forensic, econometric, and scalability analyses.
Enduring trade-offs between transparency, scalability, and confidentiality in system design.

These findings drive ongoing research in blockchain analytics, institutional adoption, forensic investigation, and the long-term sustainability and decentralization of the Bitcoin ecosystem.