Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Customer churn prediction in telecom using machine learning and social network analysis in big data platform (1904.00690v1)

Published 1 Apr 2019 in cs.CY, cs.DC, cs.LG, cs.SI, and stat.ML

Abstract: Customer churn is a major problem and one of the most important concerns for large companies. Due to the direct effect on the revenues of the companies, especially in the telecom field, companies are seeking to develop means to predict potential customer to churn. Therefore, finding factors that increase customer churn is important to take necessary actions to reduce this churn. The main contribution of our work is to develop a churn prediction model which assists telecom operators to predict customers who are most likely subject to churn. The model developed in this work uses machine learning techniques on big data platform and builds a new way of features' engineering and selection. In order to measure the performance of the model, the Area Under Curve (AUC) standard measure is adopted, and the AUC value obtained is 93.3%. Another main contribution is to use customer social network in the prediction model by extracting Social Network Analysis (SNA) features. The use of SNA enhanced the performance of the model from 84 to 93.3% against AUC standard. The model was prepared and tested through Spark environment by working on a large dataset created by transforming big raw data provided by SyriaTel telecom company. The dataset contained all customers' information over 9 months, and was used to train, test, and evaluate the system at SyriaTel. The model experimented four algorithms: Decision Tree, Random Forest, Gradient Boosted Machine Tree "GBM" and Extreme Gradient Boosting "XGBOOST". However, the best results were obtained by applying XGBOOST algorithm. This algorithm was used for classification in this churn predictive model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Abdelrahim Kasem Ahmad (1 paper)
  2. Assef Jafar (6 papers)
  3. Kadan Aljoumaa (2 papers)
Citations (287)

Summary

Machine Learning-Based Customer Churn Prediction in Telecommunications Using Big Data Platforms

The paper "Customer churn prediction in telecom using machine learning in big data platform" by Ahmad et al. presents a comprehensive paper on predicting customer churn for telecom operators using machine learning techniques within a big data framework. The research is conducted in the context of SyriaTel, a prominent telecommunications company, where the authors leverage vast datasets to improve predictive accuracy and operational efficiency in churn management.

Overview

The central aim of the research is to develop a predictive model that assists telecom companies in identifying customers at risk of churning. The model employs advanced machine learning algorithms and emphasizes the significance of feature engineering, specifically incorporating Social Network Analysis (SNA) to enhance predictive performance. Measuring performance using the Area Under Curve (AUC) metric, the authors achieved an AUC of 93.3% with their best model configuration, notably outperforming traditional methods.

Dataset and Methodology

The dataset used encompasses nine months of customer data from SyriaTel, stored on HDFS with a volume exceeding 70 terabytes. This diverse dataset includes structured, semi-structured, and unstructured data, covering customer service usage, complaints, network logs, call detail records (CDRs), and mobile device information. The authors encountered challenges typical in big data contexts, such as data variety, volume, and class imbalance, which were addressed through a robust big data platform using Spark for processing and feature extraction.

Feature Engineering

A notable contribution of the paper is the use of SNA features extracted from a large-scale social network graph composed of billions of connections between millions of nodes. These features include centrality measures, degree distribution, PageRank, and SenderRank, which capture the social interactions and influence of subscribers within the network. The authors highlight that integrating SNA features improved the model's AUC from 84% to 93.3%, demonstrating the value of social context in understanding churn behavior.

Predictive Modeling

The paper evaluates four tree-based machine learning algorithms: Decision Tree, Random Forest, Gradient Boost Machine (GBM), and Extreme Gradient Boosting (XGBoost). XGBoost emerged as the most effective algorithm with the highest AUC, attributed to its robust handling of non-linear relationships and feature interactions. The authors deployed the model in a big data environment, ensuring scalability and efficiency in real-time large-scale data processing.

Discussion and Implications

The research provides strong empirical evidence supporting the integration of SNA features in churn prediction models. The paper's findings have significant implications for telecom operators, offering a method to proactively identify at-risk customers, thereby reducing churn rates and increasing customer retention. Practically, the adoption of such models can lead to tailored marketing strategies and customer engagement initiatives based on predictive insights.

Theoretically, the successful implementation of complex SNA metrics in predictive analytics opens new avenues for exploring social dynamics in various business domains. Future work could explore the extension of such models to other industries and investigate the interplay between different types of customer interactions within predictive frameworks.

In conclusion, this paper effectively demonstrates the power and flexibility of machine learning and big data platforms in solving complex business challenges, such as customer churn in telecommunications. The integration of comprehensive data engineering and advanced analytical techniques, as presented, offers a promising toolkit for enhancing customer retention strategies in competitive and data-rich environments.