A Survey of Predictive Modelling under Imbalanced Distributions (1505.01658v2)

Published 7 May 2015 in cs.LG

Abstract: Many real world data mining applications involve obtaining predictive models using data sets with strongly imbalanced distributions of the target variable. Frequently, the least common values of this target variable are associated with events that are highly relevant for end users (e.g. fraud detection, unusual returns on stock markets, anticipation of catastrophes, etc.). Moreover, the events may have different costs and benefits, which when associated with the rarity of some of them on the available training data creates serious problems to predictive modelling techniques. This paper presents a survey of existing techniques for handling these important applications of predictive analytics. Although most of the existing work addresses classification tasks (nominal target variables), we also describe methods designed to handle similar problems within regression tasks (numeric target variables). In this survey we discuss the main challenges raised by imbalanced distributions, describe the main approaches to these problems, propose a taxonomy of these methods and refer to some related problems within predictive modelling.

Citations (190)

View on Semantic Scholar

Summary

The paper presents a comprehensive review of approaches that enhance minority class prediction using adapted metrics and algorithm modifications.
It categorizes predictive techniques into data pre-processing, algorithm adaptation, and post-processing to address data imbalance.
The study outlines future research avenues, highlighting the need for hybrid methods and refined metrics for both classification and regression tasks.

An Overview of Predictive Modelling Techniques in Imbalanced Data Environments

The paper "A Survey of Predictive Modelling under Imbalanced Distributions" by Paula Branco, Luís Torgo, and Rita P. Ribeiro presents an extensive examination of methods designed to address challenges prevalent in predictive modelling when confronted with imbalanced data distributions—a frequent occurrence across diverse real-world applications. Imbalanced data sets are characterized by a skewed distribution of class or target variable values, where one set of values—the minority class—is significantly underrepresented. This poses a substantial challenge as the events correlated with these minority values are often of critical importance, such as in fraud detection, diagnosis of rare medical conditions, or prediction of financial market anomalies.

Problem Definition and Performance Metrics

The authors define imbalanced domains in terms of the need for models that accurately predict the minority class, despite its scarcity in training data. This requires evaluation metrics biased toward minority class performance, which diverges from the standard metrics of accuracy or mean squared error that fail to acknowledge data imbalance.

For classification tasks, the paper reviews a variety of metrics like precision-recall curves and ROC-AUC, adapted to encapsulate the performance trade-offs pertinent to imbalanced domains. F-measure and geometric mean are analyzed for their utility in providing a balanced view of model efficacy. In regression tasks, however, traditional metrics like mean squared error are insufficient. The paper explores ROC for regression (RROC) and enhanced metrics like utility-based measures and precision-recall adapted to regression, offering potential strategies for evaluating models beyond numeric error magnitude.

Modelling Strategies

The survey categorizes existing approaches for handling imbalanced data into data pre-processing, special-purpose learning methods, and prediction post-processing.

Data Pre-processing involves adjusting the given data before applying learning algorithms. Techniques such as re-sampling (under-sampling and over-sampling), synthetic example generation (e.g., SMOTE), and evolutionary algorithms are discussed. These methods aim to balance the representation of classes or important cases without altering learning algorithms themselves.

Special-purpose Learning Methods involve direct modifications to the learning algorithms, making them sensitive to imbalance through strategies such as cost-sensitive learning, where different weights are assigned to different classes in the algorithm to counteract imbalance. The paper includes various modifications tailored to decision trees, SVMs, and neural networks under this rubric.

Prediction Post-processing techniques adjust model predictions after training to align results more closely with user-defined costs or preferences. Threshold methods and reframing are examples of post hoc adjustments examined in the paper.

Implications and Future Research

The survey outlines the practical implications of predictive modelling under imbalanced domains, emphasizing the need for models to accurately predict minority events due to their significant real-world impacts. It highlights regression tasks as an area needing further exploration given the current focus on classification problems. The paper also asserts that the interrelation between related problems like class overlap, small disjuncts, and noise further complicates the development of predictive models in imbalanced domains.

Future research may focus on exploring hybrid methods that combine elements from pre-processing, algorithm adaptation, and post-processing to achieve superior performance. Enhanced methodologies catering to regression tasks and ongoing efforts to integrate complex domain-specific cost structures will continue to advance this field.

In summary, the paper by Branco et al. provides a comprehensive survey of existing methodologies and unresolved challenges in predictive modelling under imbalanced distributions, laying the groundwork for ongoing research and development in both classification and regression contexts.

PDF Markdown