Learning under Concept Drift: A Review (2004.05785v1)

Published 13 Apr 2020 in cs.LG and stat.ML

Abstract: Concept drift describes unforeseeable changes in the underlying distribution of streaming data over time. Concept drift research involves the development of methodologies and techniques for drift detection, understanding and adaptation. Data analysis has revealed that machine learning in a concept drift environment will result in poor learning results if the drift is not addressed. To help researchers identify which research topics are significant and how to apply related techniques in data analysis tasks, it is necessary that a high quality, instructive review of current research developments and trends in the concept drift field is conducted. In addition, due to the rapid development of concept drift in recent years, the methodologies of learning under concept drift have become noticeably systematic, unveiling a framework which has not been mentioned in literature. This paper reviews over 130 high quality publications in concept drift related research areas, analyzes up-to-date developments in methodologies and techniques, and establishes a framework of learning under concept drift including three main components: concept drift detection, concept drift understanding, and concept drift adaptation. This paper lists and discusses 10 popular synthetic datasets and 14 publicly available benchmark datasets used for evaluating the performance of learning algorithms aiming at handling concept drift. Also, concept drift related research directions are covered and discussed. By providing state-of-the-art knowledge, this survey will directly support researchers in their understanding of research developments in the field of learning under concept drift.

Citations (1,122)

View on Semantic Scholar

Summary

The paper systematically reviews over 130 studies, establishing a comprehensive framework for drift detection, understanding, and adaptation in machine learning.
It categorizes detection techniques into error rate-based, distribution-based, and multiple hypothesis tests to robustly identify changes in data streams.
The review evaluates adaptation strategies—including retraining, ensemble methods, and adaptive models—using novel evaluation metrics for evolving data conditions.

Learning Under Concept Drift: An Overview

The paper "Learning under Concept Drift: A Review" by Jie Lu et al., endeavors to systematically synthesize research advancements in the domain of concept drift within machine learning. Concept drift, referring to unforeseeable changes in the statistical properties of data streams over time, poses significant challenges to conventional machine learning models. The paper meticulously reviews over 130 publications to construct a comprehensive framework for understanding and adapting to concept drift.

Framework for Concept Drift

The reviewed literature acknowledges three core components essential for handling concept drift:

Drift Detection: Identifying the occurrence of drift in streaming data.
Drift Understanding: Quantifying and localizing the drift.
Drift Adaptation: Adjusting the learning model to maintain or improve performance post-drift.

Drift Detection Techniques

The paper categorizes drift detection algorithms into three broad genres:

Error Rate-Based Detection: These methods, including Drift Detection Method (DDM) and Early Drift Detection Method (EDDM), focus on monitoring the online error rate of predictive models. An augmented error rate suggests a potential drift, prompting model re-evaluation or update.
Data Distribution-Based Detection: Methods such as the Information-Theoretic Approach (ITA) detect distributional discrepancies between historical and recent data via metrics like Kullback-Leibler divergence.
Multiple Hypothesis Tests: Employed in algorithms like Hierarchical Change-Detection Tests (HCDTs), these techniques leverage multiple statistical tests to provide robust drift verification.

Drift Understanding

Despite all drift detection algorithms effectively discerning the "when" of concept drift, only a subset elucidates the "how" and "where":

Severity: This is measured by the quantitative difference between historical and new data distributions. Techniques employing direct distribution measures (e.g., Kullback-Leibler divergence) provide clearer insight into drift severity.
Localization: Most data distribution-based methods can potentially highlight segments of the data stream where the drift is most pronounced.

Drift Adaptation Strategies

Given the identified drift, adaptation methods can be broadly classified into:

Retraining Models: This simplest form includes strategies like ADWIN, which triggers retraining, ensuring the new model reflects the most recent data distribution.
Ensemble Methods: Techniques such as Dynamic Weighted Majority (DWM) and Learn++.NSE maintain a pool of learned models, adaptively weighting and selecting models based on current relevance. These are particularly effective in scenarios with recurring drifts.
Adaptive Models: These methods incrementally adjust parts of the model. Decision tree-based methods like CVFDT adapt by updating relevant sub-trees rather than the entire model, making them computationally efficient for regional drifts.

Evaluation Systems

To rigorously evaluate algorithms designed for concept drift, three key facets of evaluation are proposed:

Validation Methodology: Techniques including holdout, prequential, and controlled permutation address the temporal nature of streaming data.
Evaluation Metrics: Metrics such as RAM-hours, Kappa-Temporal statistics, and Prequential AUC extend traditional evaluation measures to handle the continuous and evolving nature of data streams.
Statistical Significance: Methods like the McNemar test and the Wilcoxon's sign-rank test provide statistically robust comparisons between models handling concept drift.

Datasets and Benchmarks

Both synthetic and real-world datasets play pivotal roles in evaluating concept drift solutions:

Synthetic Datasets: Provide controlled environments, enabling precise drift analysis across various parameters and drift types.
Real-World Datasets: Offer insight into practical performances across diverse, real-world drifting conditions.

Current Developments and Future Directions

Conceptual advancements in drift detection now include multiple hypothesis testing methods. However, a gap exists in comprehensively addressing how and where drifts occur, with limited algorithms providing such specifics. Furthermore, adaptive and hybrid ensemble methods dominate recent drift adaptation research, while single model retraining strategies have seen a decline.

Unsourced Drift: One significant challenge in real-world scenarios is the availability and timely acquisition of true labels, making unsupervised and semi-supervised drift detection and adaptation promising areas for future research.

Framework and Integration: Establishing a standardized framework for selecting and evaluating real-world data streams remains imperative. Enhanced integration strategies combining concept drift handling techniques with broader machine learning methodologies, particularly within the big data context, are also essential avenues for progress.

In conclusion, this paper provides a robust stepping stone for researchers, delineating the state-of-the-art frameworks, methodologies, and evaluation systems essential for progressing concept drift research. The insights and directions suggested lay the groundwork for further explorations and refinements in this dynamic field.

PDF Markdown