Feature-Based Time-Series Analysis in R using the theft Package (2208.06146v4)
Abstract: Time series are measured and analyzed across the sciences. One method of quantifying the structure of time series is by calculating a set of summary statistics or `features', and then representing a time series in terms of its properties as a feature vector. The resulting feature space is interpretable and informative, and enables conventional statistical learning approaches, including clustering, regression, and classification, to be applied to time-series datasets. Many open-source software packages for computing sets of time-series features exist across multiple programming languages, including catch22 (22 features: Matlab, R, Python, Julia), feasts (42 features: R), tsfeatures (63 features: R), Kats (40 features: Python), tsfresh (779 features: Python), and TSFEL (390 features: Python). However, there are several issues: (i) a singular access point to these packages is not currently available; (ii) to access all feature sets, users must be fluent in multiple languages; and (iii) these feature-extraction packages lack extensive accompanying methodological pipelines for performing feature-based time-series analysis, such as applications to time-series classification. Here we introduce a solution to these issues in an R software package called theft: Tools for Handling Extraction of Features from Time series. theft is a unified and extendable framework for computing features from the six open-source time-series feature sets listed above. It also includes a suite of functions for processing and interpreting the performance of extracted features, including extensive data-visualization templates, low-dimensional projections, and time-series classification operations. With an increasing volume and complexity of time-series datasets in the sciences and industry, theft provides a standardized framework for comprehensively quantifying and interpreting informative structure in time series.
- “Indications of Nonlinear Deterministic and Finite-Dimensional Structures in Time Series of Brain Electrical Activity: Dependence on Recording Region and Brain State.” Physical Review. E, Statistical, Nonlinear, and Soft Matter Physics, 64(6 Pt 1), 061907. ISSN 1539-3755. 10.1103/PhysRevE.64.061907.
- “TSFEL: Time Series Feature Extraction Library.” SoftwareX, 11, 100456. ISSN 2352-7110. 10.1016/j.softx.2020.100456.
- “Classifying Kepler Light Curves for 12,000 A and F Stars Using Supervised Feature-Based Machine Learning.” Monthly Notices of the Royal Astronomical Society, p. stac1515. ISSN 0035-8711. 10.1093/mnras/stac1515.
- shiny: Web Application Framework for R. R package version 1.5.0, URL https://CRAN.R-project.org/package=shiny.
- “Time Series FeatuRe Extraction on Basis of Scalable Hypothesis Tests (Tsfresh – A Python Package).” Neurocomputing, 307, 72–77. ISSN 0925-2312. 10.1016/j.neucom.2018.03.067.
- “Distributed and Parallel Time Series Feature Extraction for Industrial Big Data Applications.” 10.48550/arXiv.1610.07717. 1610.07717.
- “STL: A Seasonal-Trend Decomposition Procedure Based on Loess (with Discussion).” Journal of Official Statistics, 6, 3–73.
- Day WHE, Edelsbrunner H (1984). “Efficient Algorithms for Agglomerative Hierarchical Clustering Methods.” Journal of Classification, 1(1), 7–24. ISSN 1432-1343. 10.1007/BF01890115.
- “Beyond Traditional Sleep Scoring: Massive Feature Extraction and Data-Driven Clustering of Sleep Time Series.” Sleep Medicine, 98, 39–52. ISSN 1389-9457. 10.1016/j.sleep.2022.06.013.
- Facebook Infrastructure Data Science (2021). “Kats.” URL https://facebookresearch.github.io/Kats/.
- Fulcher BD (2018). “Feature-Based Time-Series Analysis.” In Feature Engineering for Machine Learning and Data Analytics. CRC Press. ISBN 978-1-315-18108-0.
- “Highly Comparative Fetal Heart Rate Analysis.” In 2012 Annual International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 3135–3138. ISSN 1558-4615. 10.1109/EMBC.2012.6346629.
- Fulcher BD, Jones NS (2014). “Highly Comparative Feature-Based Time-Series Classification.” IEEE Transactions on Knowledge and Data Engineering, 26(12), 3026–3037. ISSN 1041-4347, 1558-2191, 2326-3865. 10.1109/TKDE.2014.2316504. 1401.3531.
- Fulcher BD, Jones NS (2017). ‘‘Hctsa: A Computational Framework for Automated Time-Series Phenotyping Using Massive Feature Extraction.” Cell Systems, 5(5), 527–531.e3. ISSN 2405-4712. 10.1016/j.cels.2017.10.001.
- “Highly Comparative Time-Series Analysis: The Empirical Structure of Time Series and Their Methods.” Journal of The Royal Society Interface, 10(83), 20130048. 10.1098/rsif.2013.0048.
- “A Self-Organizing, Living Library of Time-Series Data.” Scientific Data, 7(1), 213. ISSN 2052-4463. 10.1038/s41597-020-0553-0.
- Harris BJ (2021). Catch22.jl. https://doi.org/10.5281/zenodo.5030712. V0.2.1.
- Henderson T (2021). Rcatch22: Calculation of 22 CAnonical Time-Series CHaracteristics. R package version 0.1.12.
- Henderson T (2022). “hendersontrent/theft-webtool: v0.1.1.” 10.5281/ZENODO.6656286. URL https://zenodo.org/record/6656286.
- Henderson T, Bryant AG (2022). “hendersontrent/theft: v0.3.9.7.” 10.5281/ZENODO.6650876. URL https://zenodo.org/record/6650876.
- Henderson T, Fulcher BD (2021). ‘‘An Empirical Evaluation of Time-Series Feature Sets.” In 2021 International Conference on Data Mining Workshops (ICDMW), pp. 1032–1038. ISSN 2375-9259. 10.1109/ICDMW53433.2021.00134.
- tsfeatures: Time Series Feature Extraction. R package version 1.0.2, URL https://CRAN.R-project.org/package=tsfeatures.
- Jolliffe IT (2002). Principal Component Analysis. Springer Series in Statistics. Springer-Verlag, New York. ISBN 978-0-387-95442-4. 10.1007/b98835.
- ‘‘Prediction of Remaining Time on Site for E-Commerce Users: A SOM and Long Short-Term Memory Study.” Journal of Forecasting, n/a(n/a). ISSN 1099-131X. 10.1002/for.2771.
- “Exploring Granger Causality between Global Average Observed Time Series of Carbon Dioxide and Temperature.” Theoretical and Applied Climatology, 104(3), 325–335. ISSN 1434-4483. 10.1007/s00704-010-0342-3.
- Kuhn M (2020). caret: Classification and Regression Training. R package version 6.0-86, URL https://CRAN.R-project.org/package=caret.
- ‘‘Sensor Faults Classification for SHM Systems Using Deep Learning-Based Method with Tsfresh Features.” Smart Materials and Structures, 29(7), 075005. ISSN 0964-1726. 10.1088/1361-665X/ab85a6.
- “Catch22: CAnonical Time-series CHaracteristics.” Data Mining and Knowledge Discovery, 33(6), 1821–1852. ISSN 1573-756X. 10.1007/s10618-019-00647-x.
- “Cortical Excitation:Inhibition Imbalance Causes Abnormal Brain Network Dynamics as Observed in Neurodevelopmental Disorders.” Cerebral Cortex, 30(9), 4922–4937. ISSN 1047-3211. 10.1093/cercor/bhaa084.
- “FFORMA: Feature-based Forecast Model Averaging.” International Journal of Forecasting, 36(1), 86–92. ISSN 0169-2070. 10.1016/j.ijforecast.2019.02.011.
- Ojala M, Garriga GC (2009). “Permutation Tests for Studying Classifier Performance.” In 2009 Ninth IEEE International Conference on Data Mining, pp. 908–913. IEEE, Miami Beach, FL, USA. ISBN 978-1-4244-5242-2. 10.1109/ICDM.2009.108.
- “Behavioral Discrimination and Time-Series Phenotyping of Birdsong Performance.” PLOS Computational Biology, 17(4), e1008820. ISSN 1553-7358. 10.1371/journal.pcbi.1008820.
- “A Survey of Dimensionality Reduction Techniques.” 10.48550/arXiv.1403.2877. 1403.2877.
- Subasi A, Ismail Gursoy M (2010). “EEG Signal Classification Using PCA, ICA, LDA and Support Vector Machines.” Expert Systems with Applications, 37(12), 8659–8666. ISSN 0957-4174. 10.1016/j.eswa.2010.06.065.
- “Time Series Extrinsic Regression.” Data Mining and Knowledge Discovery, 35(3), 1032–1060. ISSN 1573-756X. 10.1007/s10618-021-00745-9.
- “Tsflex: Flexible Time Series Processing & Feature Extraction.” SoftwareX, 17, 100971. ISSN 2352-7110. 10.1016/j.softx.2021.100971.
- van der Maaten L, Hinton G (2008). “Visualizing Data Using T-SNE.” Journal of Machine Learning Research, 9(86), 2579–2605. ISSN 1533-7928.
- “Evaluation and Comparison of EEG Traces: Latent Structure in Nonstationary Time Series.” Journal of the American Statistical Association, 94(446), 375–387. ISSN 0162-1459. 10.1080/01621459.1999.10474128.
- Wickham H (2014). “Tidy Data.” Journal of Statistical Software, 59(1), 1–23. ISSN 1548-7660. 10.18637/jss.v059.i10.
- “Welcome to the Tidyverse.” Journal of Open Source Software, 4(43), 1686. ISSN 2475-9066. 10.21105/joss.01686.
- ‘‘An Anomaly Detection Algorithm Selection Service for IoT Stream Data Based on Tsfresh Tool and Genetic Algorithm.” Security and Communication Networks, 2021, 6677027. ISSN 1939-0114. 10.1155/2021/6677027.