Distance Functions and Normalization Under Stream Scenarios

Published 30 Jun 2023 in cs.LG | (2307.00106v2)

Abstract: Data normalization is an essential task when modeling a classification system. When dealing with data streams, data normalization becomes especially challenging since we may not know in advance the properties of the features, such as their minimum/maximum values, and these properties may change over time. We compare the accuracies generated by eight well-known distance functions in data streams without normalization, normalized considering the statistics of the first batch of data received, and considering the previous batch received. We argue that experimental protocols for streams that consider the full stream as normalized are unrealistic and can lead to biased and poor results. Our results indicate that using the original data stream without applying normalization, and the Canberra distance, can be a good combination when no information about the data stream is known beforehand.

Abstract PDF HTML Upgrade to Chat

References (24)

Summary

The paper demonstrates how normalization policies and specific distance functions affect k-NN classification accuracy in data streams, with Canberra distance excelling.
Methodical evaluation using eight distance functions on both synthetic and real datasets highlights trade-offs in achieving optimal stream classification.
Findings advise using original data with Canberra distance to ensure reliable, unbiased performance in dynamic stream conditions.

Analysis of Distance Functions and Normalization Techniques in Data Stream Scenarios

The paper "Distance Functions and Normalization Under Stream Scenarios" by Barboza et al. offers an in-depth exploration of data normalization and its impact on classification systems dealing with data streams. This work pivots on the complexities introduced in stream scenarios, emphasizing the challenges associated with data normalization when handling dynamic and potentially infinite data streams with unknown properties.

Key Research Questions

The paper interrogates two fundamental research questions:

Does the normalization policy influence the classifier’s competence in data streams?
Does the choice of distance function matter when classifying data streams?

These inquiries are evaluated using a thorough experimental protocol encompassing synthetic and real-world datasets. By comparing multiple distance functions under varied stream scenarios, the paper seeks to illuminate the effects of normalization on classification accuracy within data streams.

Methodology

The authors assess the accuracy of eight distance functions, including Euclidean, Manhattan, Cosine, Chebyshev, Mahalanobis, Standardized Euclidean, Minkowski, and Canberra, across different normalization scenarios. These scenarios incorporate:

Original data streams without normalization
Streams normalized with statistics from the first batch
Streams normalized with statistics from the previous batch
Unrealistic normalization using the entire stream

The k-NN classifier with k=3 is employed for classification tasks, emphasizing its reliance on distance functions. The evaluation involved both synthetic data (e.g., SEA Concepts) and diverse real-world datasets, such as Electricity, Airlines, Pokerhand, Forest Covertype, and Gas Sensor.

Numerical Results

The study presents compelling results regarding the impact of normalization policies and distance metrics. Notably, the Canberra distance function consistently provided high accuracy without requiring prior normalization, positioning it as a robust choice across various conditions. Conversely, the Cosine and Standardized Euclidean distances frequently underperformed, particularly in drifty environments where feature ranges varied substantially. It was also found that normalizing using full-stream data—despite its impracticality—may corrupt results by introducing bias.

Implications and Future Directions

The research delineates significant implications for data stream handling in machine learning. It argues for the judicious selection of normalization strategies and distance functions to optimize classification performance under real-time data flow conditions. Practically, maintaining original, non-normalized data and employing the Canberra distance emerges as a prudent strategy, offering a balance between computational efficiency and classification accuracy.

Looking ahead, future research could explore alternative scaling techniques, such as z-score normalization, and extend evaluations to continuous streams of instances. Additionally, investigating the synergy between distance functions and more advanced machine learning models could uncover deeper insights into stream processing.

In conclusion, this paper contributes substantially to the understanding of how normalization policies and distance functions interplay in data stream scenarios, providing a foundation for more effective stream-based machine learning practices. By challenging the conventional reliance on static normalization, it encourages further exploration of adaptive methodologies to cater to evolving data landscapes.

Markdown