Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
158 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Identifying Mislabeled Training Data (1106.0219v1)

Published 1 Jun 2011 in cs.AI

Abstract: This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classifiers that serve as noise filters for the training data. We evaluate single algorithm, majority vote and consensus filters on five datasets that are prone to labeling errors. Our experiments illustrate that filtering significantly improves classification accuracy for noise levels up to 30 percent. An analytical and empirical evaluation of the precision of our approach shows that consensus filters are conservative at throwing away good data at the expense of retaining bad data and that majority filters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus filters are preferable, whereas majority vote filters are preferable for situations with an abundance of data.

Citations (954)

Summary

  • The paper introduces a multi-algorithm filtering procedure using n-fold cross-validation to pinpoint mislabeled instances.
  • Filtering techniques improve classification accuracy up to 20% noise and reduce decision tree complexity.
  • The study carefully balances Type I and Type II errors, offering a reliable strategy to manage label noise in training data.

Identifying Mislabeled Training Data: A Comprehensive Analysis

The paper "Identifying Mislabeled Training Data" by Carla E. Brodley and Mark A. Friedl introduces a methodical approach to improving the quality of training data in supervised learning by detecting and eliminating mislabeled instances. This paper is significant as it directly addresses the challenge of label noise, which can notably hinder the performance of learning algorithms. The authors explore various filtering techniques, including single algorithm filters, majority vote filters, and consensus filters, and evaluate their effectiveness across several datasets prone to labeling errors.

Methodology

The authors propose a general procedure for identifying mislabeled instances. This involves using a set of learning algorithms, denoted as filter algorithms, to tag each instance as correctly or incorrectly labeled through an n-fold cross-validation process. The core idea is to build multiple classifiers from subsets of the training data and use their collective predictive errors to identify potentially mislabeled instances. This approach draws inspiration from outlier detection in regression analysis, although it significantly differs because it assumes errors in class labels are model-independent.

The paper evaluates filtering algorithms across five datasets: Automated Land Cover Mapping, Credit Approval, Scene Segmentation, Road Segmentation, and Fire Danger Prediction. Each of these datasets has known sources of labeling noise, either due to subjective judgment, data entry errors, or inherent ambiguity in the data itself.

Empirical Evaluation

The empirical evaluation examines the effects of filtering on classification accuracy and tree size. Across different noise levels, filtering substantially improves classification accuracy, particularly for noise levels up to 20%. For example, for the land cover dataset, the classification accuracy using majority filtering remains close to the baseline accuracy (defined with no artificially introduced noise) for noise levels up to 20%. In contrast, accuracy without filtering significantly drops beyond 10% noise.

A notable trend observed is the conservative nature of consensus filters. They tend to retain more data, thereby making fewer Type I errors (E1 errors, or discarding good data) but are more prone to Type II errors (E2 errors, or retaining bad data). Majority vote filters strike a balance but tend to perform slightly better in high-noise conditions, as the cost of retaining mislabeled data seems to outweigh the occasional loss of good instances.

In application to decision trees, the effect of filtering is evident in the reduced tree size. Filtering simplifies the decision boundaries leading to smaller trees, which are easier to interpret and typically more robust to overfitting. For instance, in the case of the road segmentation dataset, filtering at 20% noise level substantially reduced the number of leaves in the decision tree.

Practical Implications and Future Directions

Practically, this research suggests that integrating filtering methods into preprocessing pipelines can critically enhance the robustness and accuracy of machine learning models in real-world scenarios where data is often noisy. Particularly in domains with limited training instances, where the cost of losing good data is high, choosing an appropriate filtering approach becomes crucial.

The paper lays the groundwork for numerous future research directions. One potential area is the automatic correction of mislabeled instances rather than their elimination. This involves refining the filtering approach to not only identify noise but also suggest corrections. Another exciting direction is the differentiation between mislabeled instances and true exceptions, as automatic filtering might mistakenly eliminate rare but important cases. Developing diagnostic tools to distinguish these cases based on classification behavior and input features could further enhance data quality.

Conclusion

Brodley and Friedl's approach provides a robust method for improving training data quality by identifying and eliminating mislabeled instances. Their meticulous empirical analysis demonstrates the efficacy of different filtering methodologies, particularly highlighting the trade-offs between retaining good data and eliminating noise. This work underscores the importance of preprocessing in supervised learning and opens avenues for further innovation in dealing with noisy datasets.