Iterative Forgetting: Online Data Stream Regression Using Database-Inspired Adaptive Granulation (2403.09588v1)
Abstract: Many modern systems, such as financial, transportation, and telecommunications systems, are time-sensitive in the sense that they demand low-latency predictions for real-time decision-making. Such systems often have to contend with continuous unbounded data streams as well as concept drift, which are challenging requirements that traditional regression techniques are unable to cater to. There exists a need to create novel data stream regression methods that can handle these scenarios. We present a database-inspired datastream regression model that (a) uses inspiration from R*-trees to create granules from incoming datastreams such that relevant information is retained, (b) iteratively forgets granules whose information is deemed to be outdated, thus maintaining a list of only recent, relevant granules, and (c) uses the recent data and granules to provide low-latency predictions. The R*-tree-inspired approach also makes the algorithm amenable to integration with database systems. Our experiments demonstrate that the ability of this method to discard data produces a significant order-of-magnitude improvement in latency and training time when evaluated against the most accurate state-of-the-art algorithms, while the R*-tree-inspired granulation technique provides competitively accurate predictions
- S. Agrahari and A. K. Singh, “Concept drift detection in data stream mining : A literature review,” Journal of King Saud University - Computer and Information Sciences, vol. 34, no. 10, Part B, pp. 9523–9540, 2022.
- A. Nasrallah, A. S. Thyagaturu, Z. Alharbi, C. Wang, X. Shao, M. Reisslein, and H. ElBakoury, “Ultra-low latency (ull) networks: The ieee tsn and ietf detnet standards and related 5g ull research,” IEEE Communications Surveys and Tutorials, vol. 21, no. 1, pp. 88–145, 2019.
- H. M. Gomes, J. P. Barddal, L. E. B. Ferreira, and A. Bifet, “Adaptive random forests for data stream regression.” in ESANN, 2018.
- E. D. Almeida, C. Ferreira, and J. Gama, “Learning model rules from high-speed data streams,” CRACS, 2013.
- E. S. Page, “Continuous inspection schemes,” Biometrika, vol. 41, no. 1/2, pp. 100–115, 1954.
- E. Ikonomovska, J. Gama, B. Zenko, and S. Dzeroski, “Speeding-up hoeffding-based regression trees with options,” in Proceedings of the 28th International Conference on Machine Learning (ICML-11), 2011, pp. 537–544.
- P. Domingos and G. Hulten, “Mining high-speed data streams,” in Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 2000, pp. 71–80.
- E. Ikonomovska, J. Gama, and S. Džeroski, “Learning model trees from evolving data streams,” Data mining and knowledge discovery, vol. 23, pp. 128–168, 2011.
- Y. Sun, B. Pfahringer, H. M. Gomes, and A. Bifet, “Soknl: A novel way of integrating k-nearest neighbours with adaptive random forest regression for data streams,” Data Mining and Knowledge Discovery, vol. 36, no. 5, pp. 2006–2032, 2022.
- A. Dobra and J. Gehrke, “Secret: A scalable linear regression tree algorithm,” in Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, 2002, pp. 481–487.
- D. S. Vogel, O. Asparouhov, and T. Scheffer, “Scalable look-ahead linear regression trees,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’07. New York, NY, USA: Association for Computing Machinery, 2007, p. 757–764.
- D. W. Allan, “Statistics of atomic frequency standards,” Proceedings of the IEEE, vol. 54, no. 2, pp. 221–230, 1966.
- H. Haeri, C. E. Beal, and K. Jerath, “Near-optimal moving average estimation at characteristic timescales: An allan variance approach,” IEEE Control Systems Letters, vol. 5, no. 5, pp. 1531–1536, 2021.
- H. Haeri, B. Soleimani, and K. Jerath, “Optimal moving average estimation of noisy random walks using allan variance-informed window length,” in 2022 American Control Conference (ACC). IEEE, 2022, pp. 1646–1651.
- L. Sinanaj, H. Haeri, S. P. Maddipatla, L. Gao, R. Pakala, N. Kathiriya, C. Beal, S. Brennan, C. Chen, and K. Jerath, “Granulation of large temporal databases: An allan variance approach,” SN Computer Science, vol. 4, no. 1, p. 7, 2022.
- K. Jerath, S. Brennan, and C. Lagoa, “Bridging the gap between sensor noise modeling and sensor characterization,” Measurement, vol. 116, pp. 350–366, 2018.
- H. Haeri, N. Kathiriya, C. Chen, and K. Jerath, “Adaptive granulation: Data reduction at the database level,” in Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management - Volume 3: KMIS, INSTICC. SciTePress, 2023, pp. 29–39.
- A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “Moa: Massive online analysis,” Journal of Machine Learning Research (JMLR), 2010.
- L. Gao, C. Beal, J. Mitrovich, and S. Brennan, “Vehicle model predictive trajectory tracking control with curvature and friction preview,” IFAC-PapersOnLine, vol. 55, no. 24, pp. 221–226, 2022, 10th IFAC Symposium on Advances in Automotive Control AAC 2022.
- W. Nash, T. Sellers, S. Talbot, A. Cawthorn, and W. Ford, “Abalone,” UCI Machine Learning Repository, 1995.
- S. Vito, “Air Quality,” UCI Machine Learning Repository, 2016.
- H. Fanaee-T and J. Gama, “Event labeling combining ensemble detectors and background knowledge,” Progress in Artificial Intelligence, vol. 2, pp. 113–127, 2014.
- J. Hogue, “Metro Interstate Traffic Volume,” UCI Machine Learning Repository, 2019.
- “NYC Taxi and Limousine Commission Trip Record Data,” https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page.
- Fedesoriano, “Wind Speed Prediction Dataset,” Kaggle, 2022.
- R. Pakala, N. Kathiriya, H. Haeri, S. P. Maddipatla, K. Jerath, C. Beal, S. Brennan, and C. Chen, “Distributed edge computing system for vehicle communication,” in DATA, 2023, unpublished.