Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

162 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

45 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

342

Evaluating AI systems under uncertain ground truth: a case study in dermatology (2307.02191v1)

Published 5 Jul 2023 in cs.LG, cs.CV, stat.ME, and stat.ML

Abstract: For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.

References (125)

Citations (4)

View on Semantic Scholar

Summary

The paper introduces a novel framework that models ground truth uncertainty using statistical inference.
It applies uncertainty-adjusted metrics in dermatology to show that standard evaluations can overestimate AI performance.
The framework offers practical guidance for safer AI deployment by quantifying annotator variability in medical imaging.

Evaluating AI Systems under Uncertain Ground Truth: A Case Study in Dermatology

This paper addresses a critical issue in the evaluation of AI systems within health contexts—uncertainty in the ground truth used for validation. Traditional evaluation methods often assume a certain and fixed ground truth derived by deterministically aggregating annotations, such as majority voting. However, this assumption does not hold true in many health-related scenarios, where the ground truth can be inherently uncertain due to factors like annotator disagreement or insufficient observational data.

Ground Truth Uncertainty

The authors dissect ground truth uncertainty into two primary components: annotation uncertainty and inherent uncertainty. Annotation uncertainty arises from the imperfections in the labeling process, even when experts are involved. Inherent uncertainty, on the other hand, is due to ambiguous cases where observational information is limited and tasks might be subjective.

Ignoring these forms of uncertainty can lead to an overestimation of AI system performance. To tackle this, the paper proposes a statistical framework that models the aggregation of annotations as a posterior inference problem. This framework evaluates AI systems while explicitly accounting for the uncertainty inherent in the ground truth.

Proposed Framework

The framework introduces the concept of "plausibilities," which are distributions over possible classes in a classification task, derived through a statistical model of annotator reliability. The authors develop uncertainty-adjusted metrics to better evaluate AI systems, offering a more nuanced understanding of their performance.

For the case paper in dermatology, the authors focus on skin condition classification from images, where the annotations are given as differential diagnoses. Two statistical models are introduced for aggregation: a probabilistic interpretation of the inverse rank normalization (IRN) and a Plackett–Luce-based model. Both models highlight substantial ground truth uncertainty, which traditional methods tend to overlook.

Numerical Results and Implications

The paper finds significant ground truth uncertainty in a large portion of the dataset, demonstrated by uncertainty-adjusted metrics such as top-k accuracy and average overlap. The analysis reveals that standard, IRN-based evaluations can considerably overestimate classifier performance, masking the true variability and reliability of AI systems.

Practical and Theoretical Implications

The implications of this research are profound, both practically and theoretically. Practically, the findings advocate for more cautious deployment of AI systems in medical settings, where decisions often carry high stakes. For model developers, this framework provides a more robust method for model selection and performance evaluation.

Theoretically, this approach enriches our understanding of uncertainty in machine learning systems. It offers a pathway to a more rigorous statistical treatment of annotations, which can lead to better generalization and trust in AI predictions.

Future Directions

The proposed framework can be seen as a stepping stone for further research into evaluation methodologies that respect the complexity of real-world data. Future work could enhance the statistical models used for aggregation or explore different types of uncertainty in other domains, offering broader applicability and tunable solutions for varying levels of uncertainty.

In summary, this paper provides valuable insights into handling uncertain ground truth in AI evaluations, particularly within healthcare, and offers a concrete approach to integrating uncertainty into performance metrics, thus contributing to the development of safer and more reliable AI systems.

PDF Markdown

Tweets

https://twitter.com/davidstutz92/status/1773399375965778156

https://twitter.com/davidstutz92/status/1803470695290044452

https://twitter.com/davidstutz92/status/1803541823861727444

https://twitter.com/davidstutz92/status/1919771343064547550