Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications

Published 29 Oct 2019 in cs.LG and stat.ML | (1910.13427v1)

Abstract: We develop techniques to quantify the degree to which a given (training or testing) example is an outlier in the underlying distribution. We evaluate five methods to score examples in a dataset by how well-represented the examples are, for different plausible definitions of "well-represented", and apply these to four common datasets: MNIST, Fashion-MNIST, CIFAR-10, and ImageNet. Despite being independent approaches, we find all five are highly correlated, suggesting that the notion of being well-represented can be quantified. Among other uses, we find these methods can be combined to identify (a) prototypical examples (that match human expectations); (b) memorized training examples; and, (c) uncommon submodes of the dataset. Further, we show how we can utilize our metrics to determine an improved ordering for curriculum learning, and impact adversarial robustness. We release all metric values on training and test sets we studied.

Abstract PDF Upgrade to Chat

Authors (3)

Citations (57)

View on Semantic Scholar

Summary

The paper introduces five metrics (adv, ret, agr, conf, priv) to evaluate the representativeness and outlier status of examples in datasets.
It demonstrates strong correlations among the metrics, notably between adversarial robustness and retraining stability, validating retraining as a proxy for robustness.
The framework offers actionable insights to refine curriculum learning and data curation by isolating ambiguous instances and enhancing model interpretability.

Quantifying Outliers and Tails in Machine Learning Datasets

The paper "Distribution Density, Tails, and Outliers in Machine Learning: Metrics and Applications" addresses a significant challenge in ML: understanding and quantifying how representative or outlier-like each example is within a dataset. The authors introduce five distinct metrics to evaluate the degree to which examples in datasets such as MNIST, Fashion-MNIST, CIFAR-10, and ImageNet are well-represented or outliers. This study provides an analytical approach to measuring dataset variability, which is crucial for practitioners interested in model training, interpretability, and adversarial robustness.

Methodology and Metrics

The authors propose five metrics:

Adversarial Robustness (\textsf{adv}): This metric assesses how susceptible an example is to adversarial perturbations, with the hypothesis that well-represented examples exhibit greater robustness.
Holdout Retraining (\textsf{ret}): This metric measures the variability in model predictions when a particular example is excluded from training. The stability of the model's output signals how prototypical the instance is.
Ensemble Agreement (\textsf{agr}): This metric evaluates how consistently various models classify an example, signifying its representativeness by the level of consensus among an ensemble of models.
Model Confidence (\textsf{conf}): This metric uses model confidence as a proxy for representativeness, with higher confidence typically indicating a more typical example.
Privacy-preserving Training (\textsf{priv}): This metric leverages a model trained with differential privacy to determine representativeness. A well-represented example should be classified correctly even when privacy-related noise is introduced during training.

Key Findings

The research uncovered strong correlations among the proposed metrics across diverse datasets. Particularly noteworthy is the correlation between adversarial robustness and retraining stability, positing retraining stability as a feasible proxy for adversarial distance—a beneficial insight for tasks where adversarial examples are non-trivial to define.

Further, the study identifies unique types of examples, distinguished via metric disagreements:

Memorized Exceptions: Instances that, though confidently classified by models, resemble rare or mislabeled examples.
Uncommon Submodes: Examples that, while not well-represented according to model privacy, align with certain subpopulations within the data distribution.
Canonical Prototypes: Consensus across metrics identifies these as quintessential examples of the dataset's underlying distribution.

Implications and Future Directions

The implications of this study are manifold. Practically, the metrics can refine curriculum learning strategies, optimize data curation by isolating ambiguous or mislabeled instances, and improve model robustness by focusing on specific data substructures. Theoretically, the work establishes a framework for interpreting datasets with greater nuance, beyond simply maximizing accuracy.

The introduction of a robust framework to analyze outliers in datasets has prospective applications in fine-tuning models for adversarial robustness, achieving more interpretable ML methodologies, and constructing curriculum learning paths that align with data complexity and representativeness. Future research could pivot towards expanding these metrics to more complex, unsupervised learning tasks and exploring their potential in novel domains such as anomaly detection or fairness in AI.

In summary, this paper presents a structured approach to understanding dataset distributions within the ML context. It highlights the importance of evaluating representativeness from multiple perspectives, offering a rich set of tools for researchers and practitioners to enhance their models' training and evaluation processes.

Markdown Report Issue