Distributed and parallel time series feature extraction for industrial big data applications (1610.07717v3)

Published 25 Oct 2016 in cs.LG

Abstract: The all-relevant problem of feature selection is the identification of all strongly and weakly relevant attributes. This problem is especially hard to solve for time series classification and regression in industrial applications such as predictive maintenance or production line optimization, for which each label or regression target is associated with several time series and meta-information simultaneously. Here, we are proposing an efficient, scalable feature extraction algorithm for time series, which filters the available features in an early stage of the machine learning pipeline with respect to their significance for the classification or regression task, while controlling the expected percentage of selected but irrelevant features. The proposed algorithm combines established feature extraction methods with a feature importance filter. It has a low computational complexity, allows to start on a problem with only limited domain knowledge available, can be trivially parallelized, is highly scalable and based on well studied non-parametric hypothesis tests. We benchmark our proposed algorithm on all binary classification problems of the UCR time series classification archive as well as time series from a production line optimization project and simulated stochastic processes with underlying qualitative change of dynamics.

Citations (262)

View on Semantic Scholar

Summary

The paper introduces the FRESH algorithm that uses scalable hypothesis tests to efficiently select both strongly and weakly relevant features.
It leverages parallel computation for industrial-scale data, ensuring high scalability for real-time applications.
Benchmarking demonstrates superior performance over traditional methods, with an open-source Python package, tsfresh, available for integration.

Distributed and Parallel Time Series Feature Extraction for Industrial Big Data Applications

The paper "Distributed and Parallel Time Series Feature Extraction for Industrial Big Data Applications" introduces an algorithm designed for efficient and scalable feature extraction from time series data. This is particularly pertinent for industrial applications such as predictive maintenance and production line optimization, where each regression target or classification label is linked to multiple time series and meta-information simultaneously. The methodology focuses on identifying relevant features early in the machine learning pipeline, while controlling the proportion of irrelevant features selected.

Overview

The proposed technique combines well-established feature extraction methods with a feature importance filter, offering a robust solution to the challenging "all-relevant" feature selection problem. Specifically, the framework aims to isolate both strongly and weakly relevant attributes within large datasets, enabling improved time series classification and regression performance.

Technical Contributions

FRESH Algorithm: The core of the paper is the FRESH (FeatuRe Extraction based on Scalable Hypothesis tests) algorithm. This approach leverages a combination of statistical hypothesis tests to evaluate feature significance independently and applies the Benjamini-Yekutieli procedure to make decisions about the relevant features while controlling the false discovery rate.
Scalability and Parallelization: The algorithm's design supports parallel computation, making it highly scalable for extensive datasets. This characteristic is crucial for big data scenarios encountered in industrial settings, where data is often collected from numerous sensors and must be processed in real-time or near real-time.
Benchmarking and Implementation: The effectiveness of the FRESH algorithm is validated through extensive benchmarking on all binary classification problems from the UCR time series classification archive, production line optimization data, and simulated time series generated from stochastic processes with qualitative dynamics change. This empirical evaluation reveals its superiority over traditional feature selection approaches like Dynamic Time Warping and the Boruta algorithm, particularly in scenarios with large datasets.
Implementation: A practical contribution is the development of an open-source Python package, tsfresh, which implements the described algorithms. This package is fully compatible with popular machine learning frameworks, enabling seamless integration into existing workflows.

Implications and Future Directions

The implications of this research are significant for the field of industrial data analytics. Efficient feature extraction methods like FRESH can drastically enhance the predictive capabilities of machine learning models in IoT and Industry 4.0 environments. Practically, the algorithm's introduction should enable timely and accurate decision-making in critical industrial applications, reducing downtime and optimizing operational efficiency.

Theoretically, the work lays a foundation for further exploration into scalable and distributed feature extraction methodologies. Future research could explore extending this framework to non-binary classification tasks or incorporating deep learning techniques to enhance feature extraction capabilities. Another area of potential development is the exploration of dynamic feature mappings that adapt based on the evolving characteristics of time series data.

Conclusion

In summary, the paper presents a compelling approach to tackling the challenging problem of time series feature extraction in industrial applications. The combination of scalability, robustness, and empirical validation makes the FRESH algorithm a valuable tool in the arsenal of data scientists working with large-scale time series data. Its practical implementation through the tsfresh package further underscores its utility and accessibility for real-world applications.

PDF Markdown

Related Papers

GitHub

GitHub - blue-yonder/tsfresh: Automatic extraction of relevant features from time series: (8,835 stars)