SCARFF: a Scalable Framework for Streaming Credit Card Fraud Detection with Spark (1709.08920v1)

Published 26 Sep 2017 in cs.DC

Abstract: The expansion of the electronic commerce, together with an increasing confidence of customers in electronic payments, makes of fraud detection a critical factor. Detecting frauds in (nearly) real time setting demands the design and the implementation of scalable learning techniques able to ingest and analyse massive amounts of streaming data. Recent advances in analytics and the availability of open source solutions for Big Data storage and processing open new perspectives to the fraud detection field. In this paper we present a SCAlable Real-time Fraud Finder (SCARFF) which integrates Big Data tools (Kafka, Spark and Cassandra) with a machine learning approach which deals with imbalance, nonstationarity and feedback latency. Experimental results on a massive dataset of real credit card transactions show that this framework is scalable, efficient and accurate over a big stream of transactions.

Authors (6)

Fabrizio Carcillo (1 paper)
Andrea Dal Pozzolo (1 paper)
Yann-Aël Le Borgne (8 papers)
Olivier Caelen (5 papers)
Yannis Mazzer (1 paper)
Gianluca Bontempi (21 papers)

Citations (172)

View on Semantic Scholar

Summary

Overview of SCARFF: A Scalable Streaming Credit Card Fraud Detection Framework

The paper introduces SCARFF, a robust framework designed for real-time credit card fraud detection in the era of Big Data. It leverages Apache Kafka, Spark, and Cassandra to form a scalable infrastructure capable of ingesting, processing, and analyzing large volumes of streaming transaction data. Machine learning techniques are implemented to address various challenges inherent in fraud detection systems such as data imbalance, concept drift, and feedback latency.

Key Contributions

SCARFF's notable contributions are detailed throughout the paper, and can be summarized as follows:

Integration of Big Data Tools: The paper highlights the seamless integration of tools from the Apache ecosystem, allowing for efficient data ingestion, processing, feature engineering, and classification in a unified architecture. This includes a modification of Spark Streaming to process data in mini-batches, ensuring latency adherence.
Scalable Learning Approach: A core element of SCARFF is its scalable, distributed machine learning capability that handles nonstationary settings and class imbalances. The use of a Feedback Random Forest classifier alongside a Delayed ensemble of Balanced Random Trees addresses feedback latencies and enhances fraud detection precision by accommodating delayed labels.
Feature Engineering: The framework features an innovative on-line feature engineering aspect that derives statistics from historical transactional data, enhancing the identification of fraud patterns over time using MapReduce strategies.
Scalability and Performance: Evaluations on a dataset constituted by over 8 million transactions stress the framework's ability to manage high data throughput rates in real-world conditions, maintaining operational stability and computational efficiency.
Implementation and Reproducibility: The design focuses on a complete workflow available as a Docker container, demonstrating commitment to open source principles and reproducibility crucial for ongoing research and development.

Machine Learning Strategy

The fraud detection mechanism in the paper is sophisticated, employing two classifiers: Feedback and Delayed. The Feedback classifier is refreshed frequently with recent data for which feedback from human investigators is available, while the Delayed classifier aggregates balanced random trees trained on older transactions presumed genuine post verification delays. This dual model approach enables adaptive learning suitable for dynamically changing transaction patterns.

Scalability and Resource Management

Resource allocation with Spark executors and tuning of batch durations are critical for optimal performance. Experimentation illustrates how system robustness is maintained even under high input rates, offering insights on trade-offs between executor numbers, batch durations, and feature engineering complexity. This section emphasizes the necessity of strategic resource management in Big Data solutions for fraud detection.

Implications and Future Directions

Practically, SCARFF supports fraud detection in high-volume transaction settings, aligning well with contemporary needs in electronic commerce. Theoretically, it offers a model that underscores the importance of ensemble learning techniques in managing nonstationarity and verification latency challenges. The paper suggests future exploration into semi-supervised and active learning paradigms to further enhance detection capabilities.

In conclusion, while SCARFF is technically sophisticated and demonstrates capabilities essential for modern fraud detection systems, attention to evolving requirements in data processing and machine learning will be vital. Continued advancements in integrating distributed systems and improving classifier efficiency remain areas ripe for exploration.

Related Papers

Find Related Papers