Papers
Topics
Authors
Recent
Search
2000 character limit reached

Distance Functions and Normalization Under Stream Scenarios

Published 30 Jun 2023 in cs.LG | (2307.00106v2)

Abstract: Data normalization is an essential task when modeling a classification system. When dealing with data streams, data normalization becomes especially challenging since we may not know in advance the properties of the features, such as their minimum/maximum values, and these properties may change over time. We compare the accuracies generated by eight well-known distance functions in data streams without normalization, normalized considering the statistics of the first batch of data received, and considering the previous batch received. We argue that experimental protocols for streams that consider the full stream as normalized are unrealistic and can lead to biased and poor results. Our results indicate that using the original data stream without applying normalization, and the Canberra distance, can be a good combination when no information about the data stream is known beforehand.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. J. Lu, A. Liu, F. Dong, F. Gu, J. Gama, and G. Zhang, “Learning under concept drift: A review,” IEEE Transactions on Knowledge and Data Engineering, pp. 1–1, 2018.
  2. A. Bifet, G. Holmes, R. Kirkby, and B. Pfahringer, “MOA: massive online analysis,” J. Mach. Learn. Res., vol. 11, pp. 1601–1604, 2010. [Online]. Available: https://dl.acm.org/doi/10.5555/1756006.1859903
  3. S. Kaufman, S. Rosset, C. Perlich, and O. Stitelman, “Leakage in data mining: Formulation, detection, and avoidance,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 6, no. 4, pp. 1–21, 2012.
  4. S.-H. Cha, “Comprehensive survey on distance/similarity measures between probability density functions,” Int. J. Math. Model. Meth. Appl. Sci., vol. 1, 01 2007.
  5. É. Rodrigues, “Combining minkowski and chebyshev: New distance proposal and survey of distance metrics using k-nearest neighbours classifier,” Pattern Recognition Letters, vol. 110, pp. 66–71, jul 2018. [Online]. Available: https://doi.org/10.1016%2Fj.patrec.2018.03.021
  6. B. Lu, M. Charlton, C. Brunsdon, and P. Harris, “The minkowski approach for choosing the distance metric in geographically weighted regression,” Int. J. Geogr. Inf. Sci., vol. 30, no. 2, p. 351–368, feb 2016. [Online]. Available: https://doi.org/10.1080/13658816.2015.1087001
  7. P. R. Lisboa de Almeida, L. S. Oliveira, A. d. Souza Britto, and J. Paul Barddal, “Naïve approaches to deal with concept drifts,” in 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2020, pp. 1052–1059.
  8. E. Ogasawara, L. C. Martinez, D. De Oliveira, G. Zimbrão, G. L. Pappa, and M. Mattoso, “Adaptive normalization: A novel data normalization approach for non-stationary time series,” in The 2010 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2010, pp. 1–8.
  9. V. Gupta and R. Hewett, “Adaptive normalization in streaming data,” in Proceedings of the 2019 3rd International Conference on Big Data Research, 2019, pp. 12–17.
  10. X. Fan, Q. Wang, J. Ke, F. Yang, B. Gong, and M. Zhou, “Adversarially adaptive normalization for single domain generalization,” 2021. [Online]. Available: https://arxiv.org/abs/2106.01899
  11. V. Losing, B. Hammer, and H. Wersing, “Knn classifier with self adjusting memory for heterogeneous concept drift,” in 2016 IEEE 16th International Conference on Data Mining (ICDM), 2016, pp. 291–300.
  12. P. R. Almeida, L. S. Oliveira, A. S. Britto, and R. Sabourin, “Adapting dynamic classifier selection for concept drift,” Expert Systems with Applications, vol. 104, pp. 67–85, 2018. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0957417418301611
  13. L. B. de Amorim, G. D. Cavalcanti, and R. M. Cruz, “The choice of scaling technique matters for classification performance,” Applied Soft Computing, vol. 133, p. 109924, 2023. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1568494622009735
  14. S. Nayak, M. Bhat, N. V. S. Reddy, and B. A. Rao, “Study of distance metrics on k - nearest neighbor algorithm for star categorization,” Journal of Physics: Conference Series, vol. 2161, no. 1, p. 012004, jan 2022. [Online]. Available: https://dx.doi.org/10.1088/1742-6596/2161/1/012004
  15. G. E. A. P. A. Batista and D. F. Silva, “How k-nearest neighbor parameters affect its performance,” in Argentine Symposium on Artificial Intelligence, 2009.
  16. C. W. Yean, W. Khairunizam, M. I. Omar, M. Murugappan, B. S. Zheng, S. A. Bakar, Z. M. Razlan, and Z. Ibrahim, “Analysis of the distance metrics of knn classifier for eeg signal in stroke patients,” in 2018 International Conference on Computational Approach in Smart Systems Design and Applications (ICASSDA), 2018, pp. 1–4.
  17. T. Mladenova and I. Valova, “Analysis of the knn classifier distance metrics for bulgarian fake news detection,” in 2021 3rd International Congress on Human-Computer Interaction, Optimization and Robotic Applications (HORA), 2021, pp. 1–4.
  18. H. Mehmood, P. Kostakos, M. Cortes, T. Anagnostopoulos, S. Pirttikangas, and E. Gilman, “Concept drift adaptation techniques in distributed environment for real-world data streams,” Smart Cities, vol. 4, no. 1, pp. 349–371, 2021. [Online]. Available: https://www.mdpi.com/2624-6511/4/1/21
  19. B. Krawczyk, B. Pfahringer, and M. Woźniak, “Combining active learning with concept drift detection for data stream mining,” in 2018 IEEE International Conference on Big Data (Big Data), 2018, pp. 2239–2244.
  20. D. Singh and B. Singh, “Investigating the impact of data normalization on classification performance,” Applied Soft Computing, vol. 97, p. 105524, 2020. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1568494619302947
  21. D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  22. J. Vanschoren, J. N. van Rijn, B. Bischl, and L. Torgo, “Openml: Networked science in machine learning,” SIGKDD Explorations, vol. 15, no. 2, pp. 49–60, 2013. [Online]. Available: http://doi.acm.org/10.1145/2641190.2641198
  23. W. N. Street and Y. Kim, “A streaming ensemble algorithm (sea) for large-scale classification,” in Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’01.   New York, NY, USA: Association for Computing Machinery, 2001, p. 377–382. [Online]. Available: https://doi.org/10.1145/502512.502568
  24. J. Demšar and Z. Bosnić, “Detecting concept drift in data streams using model explanation,” Expert Systems with Applications, vol. 92, pp. 546–559, 2018.

Summary

  • The paper demonstrates how normalization policies and specific distance functions affect k-NN classification accuracy in data streams, with Canberra distance excelling.
  • Methodical evaluation using eight distance functions on both synthetic and real datasets highlights trade-offs in achieving optimal stream classification.
  • Findings advise using original data with Canberra distance to ensure reliable, unbiased performance in dynamic stream conditions.

Analysis of Distance Functions and Normalization Techniques in Data Stream Scenarios

The paper "Distance Functions and Normalization Under Stream Scenarios" by Barboza et al. offers an in-depth exploration of data normalization and its impact on classification systems dealing with data streams. This work pivots on the complexities introduced in stream scenarios, emphasizing the challenges associated with data normalization when handling dynamic and potentially infinite data streams with unknown properties.

Key Research Questions

The paper interrogates two fundamental research questions:

  1. Does the normalization policy influence the classifier’s competence in data streams?
  2. Does the choice of distance function matter when classifying data streams?

These inquiries are evaluated using a thorough experimental protocol encompassing synthetic and real-world datasets. By comparing multiple distance functions under varied stream scenarios, the paper seeks to illuminate the effects of normalization on classification accuracy within data streams.

Methodology

The authors assess the accuracy of eight distance functions, including Euclidean, Manhattan, Cosine, Chebyshev, Mahalanobis, Standardized Euclidean, Minkowski, and Canberra, across different normalization scenarios. These scenarios incorporate:

  • Original data streams without normalization
  • Streams normalized with statistics from the first batch
  • Streams normalized with statistics from the previous batch
  • Unrealistic normalization using the entire stream

The k-NN classifier with k=3 is employed for classification tasks, emphasizing its reliance on distance functions. The evaluation involved both synthetic data (e.g., SEA Concepts) and diverse real-world datasets, such as Electricity, Airlines, Pokerhand, Forest Covertype, and Gas Sensor.

Numerical Results

The study presents compelling results regarding the impact of normalization policies and distance metrics. Notably, the Canberra distance function consistently provided high accuracy without requiring prior normalization, positioning it as a robust choice across various conditions. Conversely, the Cosine and Standardized Euclidean distances frequently underperformed, particularly in drifty environments where feature ranges varied substantially. It was also found that normalizing using full-stream data—despite its impracticality—may corrupt results by introducing bias.

Implications and Future Directions

The research delineates significant implications for data stream handling in machine learning. It argues for the judicious selection of normalization strategies and distance functions to optimize classification performance under real-time data flow conditions. Practically, maintaining original, non-normalized data and employing the Canberra distance emerges as a prudent strategy, offering a balance between computational efficiency and classification accuracy.

Looking ahead, future research could explore alternative scaling techniques, such as z-score normalization, and extend evaluations to continuous streams of instances. Additionally, investigating the synergy between distance functions and more advanced machine learning models could uncover deeper insights into stream processing.

In conclusion, this paper contributes substantially to the understanding of how normalization policies and distance functions interplay in data stream scenarios, providing a foundation for more effective stream-based machine learning practices. By challenging the conventional reliance on static normalization, it encourages further exploration of adaptive methodologies to cater to evolving data landscapes.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.