Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention for Predictive Process Monitoring (2107.01905v1)

Published 5 Jul 2021 in cs.AI

Abstract: Advances in AI, and especially machine learning, are increasingly drawing research interest and efforts towards predictive process monitoring, the subfield of process mining (PM) that concerns predicting next events, process outcomes and remaining execution times. Unfortunately, researchers use a variety of datasets and ways to split them into training and test sets. The documentation of these preprocessing steps is not always complete. Consequently, research results are hard or even impossible to reproduce and to compare between papers. At times, the use of non-public domain knowledge further hampers the fair competition of ideas. Often the training and test sets are not completely separated, a data leakage problem particular to predictive process monitoring. Moreover, test sets usually suffer from bias in terms of both the mix of case durations and the number of running cases. These obstacles pose a challenge to the field's progress. The contribution of this paper is to identify and demonstrate the importance of these obstacles and to propose preprocessing steps to arrive at unbiased benchmark datasets in a principled way, thus creating representative test sets without data leakage with the aim of levelling the playing field, promoting open science and contributing to more rapid progress in predictive process monitoring.

Citations (14)

View on Semantic Scholar

Summary

The paper proposes a standardized preprocessing method that prevents data leakage and reduces biases in predictive process monitoring datasets.
It demonstrates that strict temporal splitting and debiasing techniques lead to significant improvements in model predictive accuracy.
The study emphasizes the need for publicly available, unbiased benchmark datasets to ensure fair comparisons in process mining research.

Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention for Predictive Process Monitoring

The paper focuses on improving the methodology for constructing benchmark datasets in the field of predictive process monitoring (PPM). It identifies critical issues in existing datasets, such as data leakage and biases, which hinder scientific progress and reproducibility. The authors propose a principled approach to dataset preprocessing to standardize and improve the validity of comparisons among various predictive models.

Introduction to Predictive Process Monitoring

Predictive process monitoring is a sub-field of process mining that utilizes event logs to predict future events, outcomes, and remaining execution times of processes. Despite the rise in interest and research in this area, there are substantial inconsistencies in the preprocessing of datasets. These inconsistencies arise due to different dataset selection and splitting methods, often exacerbated by undocumented domain knowledge, leading to challenges in reproducibility and fair comparison of models.

Major Obstacles in Fair Benchmarking

Dataset Diversification and Preprocessing Biases

The use of diverse datasets and preprocessing techniques complicates the comparison of predictive models. Additionally, some preprocessing methods introduce biases related to case durations and running processes, especially at the chronological extremes of datasets. These biases lead to skewed model performances and unreliable predictions.

Data Leakage

Data leakage is a significant obstacle in temporal splitting methods currently employed, where training and test datasets are not properly isolated. This affects the predictive performance due to the inadvertent sharing of data between training and testing phases.

Lack of Standardized Benchmarks

The absence of standardized benchmarks in PPM equivalent to those in other AI domains (like MNIST for image processing) impedes progress. The creation of unbiased and publicly accessible benchmark datasets is crucial for advancing the field.

Proposed Method for Benchmark Dataset Construction

Principled Approach

The authors suggest a standard preprocessing method to establish unbiased benchmark datasets. They advocate for the isolation of test sets using strict temporal splitting to prevent data leakage and ensure an unbiased evaluation environment.

Debiasing Techniques

The paper outlines techniques to remove chronological outliers and balance case durations and running cases within datasets. Figure 1 illustrates the presence of remote outliers, while Figure 2 highlights biases at dataset ends.

Figure 1: BPIC_2015 contains remote, faulty outliers at the beginning of the dataset.

Figure 2: Grey bars represent test set events (20\%). Green bars are events in training set cases ending before the separation time (strict temporal splitting), red ones belong to cases ending after that separation time (regular temporal splitting). As debiasing was not done to retain sufficient samples, the bias at the datasets' ends is clearly visible.

Training and Testing Set Configuration

By implementing a strict temporal splitting strategy, the paper suggests a configuration where training sets consist only of completed cases before a certain separation time, ensuring no overlap with test cases.

Impact of Preprocessing on Model Performance

The authors evaluated their preprocessing approach using convolutional neural networks (CNNs) to predict remaining times across various datasets. The results indicate that preprocessing choices significantly affect predictive accuracy, as shown in Figure 3.

Figure 3: Prediction results (MAE) vary significantly in function of the chosen training and test data sets.

The implementation of proposed preprocessing steps improved model performance by reducing biases and preventing data leakage. Additionally, removing extremely long cases helped maintain dataset integrity and usability.

Conclusion

This paper identifies crucial challenges within predictive process monitoring related to dataset preprocessing and highlights the need for unbiased benchmarks. By offering specific preprocessing recommendations and making datasets publicly available, the authors aim to streamline research efforts, enable fair comparisons, and accelerate progress in PPM. Future work includes applying these methodologies to new datasets and exploring how the proposed benchmarks influence the development of next-generation predictive models.