Distance Functions and Normalization Under Stream Scenarios
Abstract: Data normalization is an essential task when modeling a classification system. When dealing with data streams, data normalization becomes especially challenging since we may not know in advance the properties of the features, such as their minimum/maximum values, and these properties may change over time. We compare the accuracies generated by eight well-known distance functions in data streams without normalization, normalized considering the statistics of the first batch of data received, and considering the previous batch received. We argue that experimental protocols for streams that consider the full stream as normalized are unrealistic and can lead to biased and poor results. Our results indicate that using the original data stream without applying normalization, and the Canberra distance, can be a good combination when no information about the data stream is known beforehand.
- J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2018.
- A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive online analysis,” J. Mach. Learn. Res., vol. 11, pp. 1601–1604, 2010. [Online]. Available: https://dl.acm.org/doi/10.5555/1756006.1859903
- S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman, “Leakage in data mining: Formulation, detection, and avoidance,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 6, no. 4, pp. 1–21, 2012.
- S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” Int. J. Math. Model. Meth. Appl. Sci., vol. 1, 01 2007.
- É. Rodrigues, “Combining minkowski and chebyshev: New distance proposal and survey of distance metrics using k-nearest neighbours classifier,” Pattern Recognition Letters, vol. 110, pp. 66–71, jul 2018. [Online]. Available: https://doi.org/10.1016%2Fj.patrec.2018.03.021
- B. Lu, M. Charlton, C. Brunsdon, and P. Harris, “The minkowski approach for choosing the distance metric in geographically weighted regression,” Int. J. Geogr. Inf. Sci., vol. 30, no. 2, p. 351–368, feb 2016. [Online]. Available: https://doi.org/10.1080/13658816.2015.1087001
- P. R. Lisboa de Almeida, L. S. Oliveira, A. d. Souza Britto, and J. Paul Barddal, “Naïve approaches to deal with concept drifts,” in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2020, pp. 1052–1059.
- E. Ogasawara, L. C. Martinez, D. De Oliveira, G. Zimbrão, G. L. Pappa, and M. Mattoso, “Adaptive normalization: A novel data normalization approach for non-stationary time series,” in The 2010 International Joint Conference on Neural Networks (IJCNN). IEEE, 2010, pp. 1–8.
- V. Gupta and R. Hewett, “Adaptive normalization in streaming data,” in Proceedings of the 2019 3rd International Conference on Big Data Research, 2019, pp. 12–17.
- X. Fan, Q. Wang, J. Ke, F. Yang, B. Gong, and M. Zhou, “Adversarially adaptive normalization for single domain generalization,” 2021. [Online]. Available: https://arxiv.org/abs/2106.01899
- V. Losing, B. Hammer, and H. Wersing, “Knn classifier with self adjusting memory for heterogeneous concept drift,” in 2016 IEEE 16th International Conference on Data Mining (ICDM), 2016, pp. 291–300.
- P. R. Almeida, L. S. Oliveira, A. S. Britto, and R. Sabourin, “Adapting dynamic classifier selection for concept drift,” Expert Systems with Applications, vol. 104, pp. 67–85, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417418301611
- L. B. de Amorim, G. D. Cavalcanti, and R. M. Cruz, “The choice of scaling technique matters for classification performance,” Applied Soft Computing, vol. 133, p. 109924, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1568494622009735
- S. Nayak, M. Bhat, N. V. S. Reddy, and B. A. Rao, “Study of distance metrics on k - nearest neighbor algorithm for star categorization,” Journal of Physics: Conference Series, vol. 2161, no. 1, p. 012004, jan 2022. [Online]. Available: https://dx.doi.org/10.1088/1742-6596/2161/1/012004
- G. E. A. P. A. Batista and D. F. Silva, “How k-nearest neighbor parameters affect its performance,” in Argentine Symposium on Artificial Intelligence, 2009.
- C. W. Yean, W. Khairunizam, M. I. Omar, M. Murugappan, B. S. Zheng, S. A. Bakar, Z. M. Razlan, and Z. Ibrahim, “Analysis of the distance metrics of knn classifier for eeg signal in stroke patients,” in 2018 International Conference on Computational Approach in Smart Systems Design and Applications (ICASSDA), 2018, pp. 1–4.
- T. Mladenova and I. Valova, “Analysis of the knn classifier distance metrics for bulgarian fake news detection,” in 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 2021, pp. 1–4.
- H. Mehmood, P. Kostakos, M. Cortes, T. Anagnostopoulos, S. Pirttikangas, and E. Gilman, “Concept drift adaptation techniques in distributed environment for real-world data streams,” Smart Cities, vol. 4, no. 1, pp. 349–371, 2021. [Online]. Available: https://www.mdpi.com/2624-6511/4/1/21
- B. Krawczyk, B. Pfahringer, and M. Woźniak, “Combining active learning with concept drift detection for data stream mining,” in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 2239–2244.
- D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1568494619302947
- D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
- J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “Openml: Networked science in machine learning,” SIGKDD Explorations, vol. 15, no. 2, pp. 49–60, 2013. [Online]. Available: http://doi.acm.org/10.1145/2641190.2641198
- W. N. Street and Y. Kim, “A streaming ensemble algorithm (sea) for large-scale classification,” in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’01. New York, NY, USA: Association for Computing Machinery, 2001, p. 377–382. [Online]. Available: https://doi.org/10.1145/502512.502568
- J. Demšar and Z. Bosnić, “Detecting concept drift in data streams using model explanation,” Expert Systems with Applications, vol. 92, pp. 546–559, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.