Anti-Money Laundering in Bitcoin: Experimenting with Graph Convolutional Networks for Financial Forensics (1908.02591v1)

Published 31 Jul 2019 in cs.SI, cs.CY, cs.LG, and q-fin.GN

Abstract: Anti-money laundering (AML) regulations play a critical role in safeguarding financial systems, but bear high costs for institutions and drive financial exclusion for those on the socioeconomic and international margins. The advent of cryptocurrency has introduced an intriguing paradox: pseudonymity allows criminals to hide in plain sight, but open data gives more power to investigators and enables the crowdsourcing of forensic analysis. Meanwhile advances in learning algorithms show great promise for the AML toolkit. In this workshop tutorial, we motivate the opportunity to reconcile the cause of safety with that of financial inclusion. We contribute the Elliptic Data Set, a time series graph of over 200K Bitcoin transactions (nodes), 234K directed payment flows (edges), and 166 node features, including ones based on non-public data; to our knowledge, this is the largest labelled transaction data set publicly available in any cryptocurrency. We share results from a binary classification task predicting illicit transactions using variations of Logistic Regression (LR), Random Forest (RF), Multilayer Perceptrons (MLP), and Graph Convolutional Networks (GCN), with GCN being of special interest as an emergent new method for capturing relational information. The results show the superiority of Random Forest (RF), but also invite algorithmic work to combine the respective powers of RF and graph methods. Lastly, we consider visualization for analysis and explainability, which is difficult given the size and dynamism of real-world transaction graphs, and we offer a simple prototype capable of navigating the graph and observing model performance on illicit activity over time. With this tutorial and data set, we hope to a) invite feedback in support of our ongoing inquiry, and b) inspire others to work on this societally important challenge.

Citations (287)

View on Semantic Scholar

Summary

The paper introduces an AML framework that employs Graph Convolutional Networks to analyze Bitcoin transactions from a large, labeled dataset.
It benchmarks several models, showing that while Random Forests achieve the highest F1 score, GCNs capture relational insights from blockchain data.
The findings demonstrate the complementary strengths of ensemble and graph-based methods, paving the way for integrated AML solutions in financial forensics.

Anti-Money Laundering in Bitcoin: Utilization of Graph Convolutional Networks

This paper investigates the application of machine learning models, specifically Graph Convolutional Networks (GCNs), in the domain of Anti-Money Laundering (AML) in cryptocurrency, with a focus on Bitcoin transaction analysis. The paper is presented in the context of mitigating the burgeoning challenge of financial crime facilitated by cryptocurrency anonymity, while simultaneously enhancing financial inclusion for marginalized groups.

Introduction to the Problem

The authors position the problem of AML within the dichotomy of ensuring security against illicit financial activities while promoting financial inclusivity. The traditional AML regulations often act as a deterrent to illegal activity, yet impose significant compliance costs on financial institutions and inadvertently exclude socioeconomically disadvantaged groups from participation in the financial system.

Bitcoin, as a pseudonymous system, becomes a double-edged sword—providing criminals with a venue to capitalize on anonymity, while simultaneously offering an open dataset that could empower AML investigations through comprehensive scrutiny.

The Elliptic Data Set

To tackle these challenges, the authors introduce the Elliptic Data Set, a graph-structured dataset encompassing over 200,000 Bitcoin transactions. This dataset is posited as the largest publicly available labeled Bitcoin transaction dataset, offering robust opportunities for developing machine learning models that can discern between licit and illicit transactions based on numerous features derived from transaction data.

Methodologies

The paper benchmarks various machine learning techniques for predicting illicit activities in Bitcoin transactions. The approaches include Logistic Regression (LR), Multilayer Perceptrons (MLP), Random Forest (RF), and Graph Convolutional Networks (GCNs).

Random Forest and Logistic Regression: RF demonstrated the highest performance, likely due to its robustness in modeling complex decision boundaries with ensemble learning. LR acted as a comparative baseline, emphasizing the benefits of more sophisticated models.
Graph Convolutional Networks: GCNs leverage the graph-based structure inherent in blockchain transaction data, allowing the extraction of more extensive relational information than possible with flat feature-based models. The paper found GCNs to provide competitive performance, although RF outperformed it on several metrics.
Temporal Extensions - EvolveGCN: Considering transaction data is temporal, EvolveGCN's incorporation of recurrent neural networks to model dynamic graph changes over time exhibited a slight advantage over static GCN applications.

Results

The experimental results indicated that Random Forest achieved the highest $F_1$ score for illicit transaction detection, confirming its capability in handling AML tasks. However, the use of graph embeddings as additional features demonstrated the complementary nature of GCN in improving overall model performance. Notably, the paper also highlighted the robustness challenges when model performance deteriorates following significant network events, such as the shutdown of a major illicit operation.

Implications and Future Directions

The findings emphasize the complementary strengths of RF and GCN methodologies, suggesting potential avenues for integrating these approaches to harness their respective advantages in AML systems. Future investigations could explore methods for effective post-event model retraining and potential architectural innovations combining ensemble methods with graph-based deep learning.

The provision of the Elliptic Data Set to the wider research community is a pivotal contribution, facilitating further exploration and development of robust AML strategies within the domain of financial forensics. The use of visualization tools like Chronograph also underscores the crucial role of explainability in compliance and law enforcement settings, balancing algorithmic transparency with model complexity.

Overall, this paper lays a foundational effort in marrying technological advancements in machine learning with practical needs in financial security, aiming to address the multi-faceted challenges of AML in the increasingly complex landscape of cryptocurrencies.

PDF Markdown