Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
162 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating AI systems under uncertain ground truth: a case study in dermatology (2307.02191v1)

Published 5 Jul 2023 in cs.LG, cs.CV, stat.ME, and stat.ML

Abstract: For safety, AI systems in health undergo thorough evaluations before deployment, validating their predictions against a ground truth that is assumed certain. However, this is actually not the case and the ground truth may be uncertain. Unfortunately, this is largely ignored in standard evaluation of AI models but can have severe consequences such as overestimating the future performance. To avoid this, we measure the effects of ground truth uncertainty, which we assume decomposes into two main components: annotation uncertainty which stems from the lack of reliable annotations, and inherent uncertainty due to limited observational information. This ground truth uncertainty is ignored when estimating the ground truth by deterministically aggregating annotations, e.g., by majority voting or averaging. In contrast, we propose a framework where aggregation is done using a statistical model. Specifically, we frame aggregation of annotations as posterior inference of so-called plausibilities, representing distributions over classes in a classification setting, subject to a hyper-parameter encoding annotator reliability. Based on this model, we propose a metric for measuring annotation uncertainty and provide uncertainty-adjusted metrics for performance evaluation. We present a case study applying our framework to skin condition classification from images where annotations are provided in the form of differential diagnoses. The deterministic adjudication process called inverse rank normalization (IRN) from previous work ignores ground truth uncertainty in evaluation. Instead, we present two alternative statistical models: a probabilistic version of IRN and a Plackett-Luce-based model. We find that a large portion of the dataset exhibits significant ground truth uncertainty and standard IRN-based evaluation severely over-estimates performance without providing uncertainty estimates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (125)
  1. Consistency is key: Disentangling label variation in natural language processing with intra-annotator agreement. arXiv.org, abs/2301.10684, 2023.
  2. D. Angluin and P. D. Laird. Learning from noisy examples. Machine Learning, 2(4):343–370, 1987.
  3. L. Aroyo and C. Welty. The three sides of crowdtruth. Journal of Human Computation, 1(1):31–44, 2014.
  4. L. Aroyo and C. Welty. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24, 2015.
  5. Robust and efficient medical imaging with self-supervision. arXiv.org, abs/2205.09723, 2022.
  6. Stop measuring calibration when humans disagree. arXiv.org, abs/2210.16133, 2022.
  7. How to grade a test without knowing the answers - A bayesian graphical model for adaptive crowdsourcing and aptitude testing. In Proc. of the International Conference on Machine Learning (ICML), 2012.
  8. Fine-tuning language models to find agreement among humans with diverse preferences. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  9. We need to consider disagreement in evaluation. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, 2021.
  10. E. Beigman and B. B. Klebanov. Learning with annotation noise. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL). The Association for Computer Linguistics, 2009.
  11. Missing information, unresponsive authors, experimental flaws: The impossibility of assessing the reproducibility of previous human evaluations in NLP. arXiv.org, abs/2305.01633, 2023.
  12. Why does a visual question have different answers? In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2019.
  13. Measuring annotator agreement generally across complex structured, multi-object, and free-text annotation tasks. In Proc. of the International World Wide Web Conference (WWW), 2022.
  14. The elephant in the machine: Proposing a new metric of data reliability and its application to a medical case to assess classification reliability. Applied Sciences, 10(11):4014, 2020.
  15. F. Caron and A. Doucet. Efficient bayesian inference for generalized bradley–terry models. Journal of Computational and Graphical Statistics, 21:174 – 196, 2010.
  16. A. Carvalho and K. Larson. A consensual linear opinion pool. In Proc. of the International Joint Conference on Artificial Intelligence (IJCAI), 2013.
  17. Understanding and utilizing deep neural networks trained with noisy labels. In Proc. of the International Conference on Machine Learning (ICML), 2019a.
  18. Robustness of accuracy metric and its inspirations in learning with noisy labels. In Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pages 11451–11461. AAAI Press, 2021. URL https://ojs.aaai.org/index.php/AAAI/article/view/17364.
  19. Cicero: Multi-turn, contextual argumentation for accurate crowdsourcing. In Proc. of the Conference on Human Factors in Computing Systems, 2019b.
  20. Learning from crowds by modeling common confusions. In Proc. of the Conference on Artificial Intelligence (AAAI), 2021.
  21. J. Cohen. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46, 1960.
  22. Eliciting and learning with soft labels from every annotator. In Proc. of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP), 2022.
  23. Learning from partial labels. Journal of Machine Learning Research (JMLR), 12, 2011.
  24. Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics (TACL), 10:92–110, 2022.
  25. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society (JRSS), 28(1):20–28, 1979.
  26. Did it happen? the pragmatic complexity of veridicality assessment. Computational Linguistics, 2012.
  27. Integrating conflicting data: The role of source dependence. Proc. of the VLDB Endowment, 2(1):550–561, 2009.
  28. Microtalk: Using argumentation to improve crowdsourcing accuracy. In Proc. of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP), volume 4, 2016.
  29. Improving reference standards for validation of ai-based radiography. The British Journal of Radiology, 94, 2021.
  30. A crowdsourced frame disambiguation corpus with ambiguity. In J. Burstein, C. Doran, and T. Solorio, editors, Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2019.
  31. Measuring clinician–machine agreement in differential diagnoses for dermatology. British Journal of Dermatology, 182, 2019.
  32. H. T. Esfandarani and P. Milanfar. NIMA: neural image assessment. IEEE Trans. on Image Processing (TIP), 27(8):3998–4011, 2018.
  33. Comparing top k lists. SIAM Journal on Discrete Mathematics (SIDMA), 17(1):134–160, 2003.
  34. Comparing and aggregating rankings with ties. In Proc. of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS), 2004.
  35. High agreement but low kappa: I. the problems of two paradoxes. Journal of Clinical Epidemiology, 43 6:543–9, 1990.
  36. A survey of race, racism, and anti-racism in NLP. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
  37. Statistical methods for rates and proportions, 3rd edition. Wiley, 2003.
  38. Bayesian nonparametric plackett-luce models for the analysis of clustered ranked data. arXiv.org, abs/1211.5037, 2012.
  39. Iterative quality control strategies for expert medical image labeling. In Proc. of the AAAI Conference on Human Computation and Crowdsourcing (HCOMP), pages 60–71, 2021.
  40. Deep label distribution learning with label ambiguity. IEEE Trans. on Image Processing (TIP), 26(6):2825–2838, 2017.
  41. Training deep neural nets to aggregate crowdsourced responses. In Proc. of the Conference on Uncertainty in Artificial Intelligence (UAI), volume 242251, 2016.
  42. The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Proc. of the Conference on Human Factors in Computing Systems, 2021.
  43. Jury learning: Integrating dissenting voices into machine learning models. In Proc. of the Conference on Human Factors in Computing Systems, 2022.
  44. Who said what: Modeling individual labelers improves classification. In Proc. of the Conference on Artificial Intelligence (AAAI), 2018.
  45. Using trusted data to train deep networks on labels corrupted by severe noise. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  46. E. Hüllermeier and J. Beringer. Learning from ambiguously labeled examples. In Proc. of the International Symposium on Intelligent Data Analysis (IDA), 2005.
  47. D. R. Hunter. Mm algorithms for generalized bradley-terry models. The Annals of Statistics, 32:384–406, 2003.
  48. Development and assessment of an artificial intelligence-based tool for skin condition diagnosis by primary care physicians and nurse practitioners in teledermatology practices. Journal of the American Medical Association (JAMA), 4 4, 2021.
  49. M. J. Kearns. Efficient noise-tolerant learning from statistical queries. Journal of the ACM, 45(6):983–1006, 1998.
  50. M. J. Kearns and M. Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807–837, 1993.
  51. The hateful memes challenge: Detecting hate speech in multimodal memes. In Advances in Neural Information Processing Systems (NeurIPS), 2020.
  52. Big transfer (bit): General visual representation learning. In A. Vedaldi, H. Bischof, T. Brox, and J. Frahm, editors, Proc. of the European Conference on Computer Vision (ECCV), 2020.
  53. Crowdsourcing in computer vision. Foundations and Trends in Computer Graphics and Vision, 10(3):177–243, 2016.
  54. A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009.
  55. R. Kumar and S. Vassilvitskii. Generalized distances between rankings. In Proc. of the International World Wide Web Conference (WWW), 2010.
  56. The measurement of observer agreement for categorical data. Biometrics, pages 159–174, 1977.
  57. N. D. Lawrence and B. Schölkopf. Estimating a kernel fisher discriminant in the presence of label noise. In Proc. of the International Conference on Machine Learning (ICML), 2001.
  58. Agreeing to disagree: Annotating offensive language datasets with annotators’ disagreement. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2021.
  59. Semeval-2023 task 11: Learning with disagreements (lewidi). arXiv.org, abs/2304.14803, 2023.
  60. Truth finding on the deep web: Is the problem solved? Proc. of the VLDB Endowment, 6(2):97–108, 2012.
  61. A deep learning system for differential diagnosis of skin diseases. Nature Medicine, 26:900–908, 2020.
  62. Discrepancy ratio: Evaluating model performance when even experts disagree on the truth. In Proc. of the International Conference on Learning Representations (ICLR), 2020.
  63. RAPID: rating pictorial aesthetics using deep learning. In Proc. of the ACM International Conference on Multimedia (MM), 2014.
  64. R. D. Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2012.
  65. Why rankings of biomedical image analysis competitions should be interpreted with care. Nature Communications, 9, 2018.
  66. M. L. McHugh. Interrater reliability: the kappa statistic. Biochemia Medica, 22:276 – 282, 2012.
  67. The definition of glaucomatous optic neuropathy in artificial intelligence research and clinical applications. Ophthalmology. Glaucoma, 2023.
  68. AVA: A large-scale database for aesthetic visual analysis. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  69. N. Nguyen and R. Caruana. Classification with partial labels. In Proc. of the ACM International Conference on Knowledge Discovery & Data Mining, 2008.
  70. What can we learn from collective human opinions on natural language inference data? In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2020.
  71. Confident learning: Estimating uncertainty in dataset labels. arXiv.org, abs/1911.00068, 2019.
  72. Pervasive label errors in test sets destabilize machine learning benchmarks. In Advances in Neural Information Processing Systems (NeurIPS) Workshops, 2021.
  73. Robustness to label noise depends on the shape of the noise distribution in feature space. arXiv.org, abs/2206.01106, 2022.
  74. E. Pavlick and T. Kwiatkowski. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics (TACL), 7:677–694, 2019.
  75. Human uncertainty makes classification more robust. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2019.
  76. Dynamic programming for instance annotation in multi-instance multi-label learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(12):2381–2394, 2017.
  77. Deep learning and glaucoma specialists: The relative importance of optic disc features to predict glaucoma referral in fundus photographs. Ophthalmology, 2019.
  78. R. L. Plackett. The analysis of permutations. Journal of The Royal Statistical Society Series C-applied Statistics, 24:193–202, 1975.
  79. B. Plank. The ’problem’ of human label variation: On ground truth in data, modeling and evaluation. arXiv.org, abs/2211.02570, 2022.
  80. D. M. W. Powers. The problem with kappa. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), pages 345–355, 2012.
  81. In search of ambiguity: A three-stage workflow design to clarify annotation guidelines for crowd workers. Frontiers Artificial Intelligence, 5:828187, 2022.
  82. Direct uncertainty prediction for medical second opinions. In Proc. of the International Conference on Machine Learning (ICML), 2019.
  83. D. Reidsma and R. op den Akker. Exploiting ’subjective’ annotations. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2008.
  84. F. Rodrigues and F. C. Pereira. Deep learning from crowds. In Proc. of the Conference on Artificial Intelligence (AAAI), 2018.
  85. Hatecheck: Functional tests for hate speech detection models. In Proc. of the Annual Meeting of the Association for Computational Linguistics (ACL), 2021.
  86. Two contrasting data annotation paradigms for subjective NLP tasks. In Proc. of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), 2022.
  87. Does your dermatology classifier know what it doesn’t know? detecting the long-tail of unseen conditions. arXiv.org, abs/2104.03829, 2021.
  88. Does your dermatology classifier know what it doesn’t know? detecting the long-tail of unseen conditions. Medical Image Analysis, 75:102274, 2022.
  89. Imagenet large scale visual recognition challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.
  90. T. Sakai. Metrics, statistics, tests. In PROMISE Winter School, 2013.
  91. Why don’t you do it right? analysing annotators’ disagreement in subjective tasks. In Proc. of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2023.
  92. M. Schaekermann. Human-AI Interaction in the Presence of Ambiguity: From Deliberation-based Labeling to Ambiguity-aware AI. Ph.d. thesis, University of Waterloo, 2020.
  93. Resolvable vs. irresolvable ambiguity: A new hybrid framework for dealing with uncertain ground truth. In SIGCHI Workshop on Human-Centered Machine Learning, volume 2016, 2016.
  94. Understanding expert disagreement in medical data analysis through structured adjudication. Proc. ACM Hum. Comput. Interact., 3:76:1–76:23, 2019a.
  95. Remote tool-based adjudication for grading diabetic retinopathy. Translational Vision Science & Technology, 8(6), 2019b.
  96. Ambiguity-aware AI assistants for medical data analysis. In Proc. of the Conference on Human Factors in Computing Systems, pages 1–14. ACM, 2020a.
  97. Expert discussions improve comprehension of difficult cases in medical image assessment. In Proc. of the Conference on Human Factors in Computing Systems, 2020b.
  98. D. Sculley. Rank aggregation for similar items. In Proc. of the SIAM International Conference on Data Mining (SDM), 2007.
  99. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proc. of the ACM International Conference on Knowledge Discovery & Data Mining, 2008.
  100. G. S. Shieh. A weighted kendall’s tau statistic. Statistics & Probability Letters, 39(1):17–24, 1998.
  101. Wise teamwork: Collective confidence calibration predicts the effectiveness of group discussion. Journal of Experimental Social Psychology, 96, 2021.
  102. Inferring ground truth from subjective labelling of venus images. In Advances in Neural Information Processing Systems (NeurIPS), 1994.
  103. Cheap and fast - but is it good? evaluating non-expert annotations for natural language tasks. In Proc. of the Conference on Empirical Methods in Natural Language Processing, 2008.
  104. A. Sorokin and D. A. Forsyth. Utility data annotation with amazon mechanical turk. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2008.
  105. Revisiting unreasonable effectiveness of data in deep learning era. In Proc. of the IEEE International Conference on Computer Vision (ICCV), 2017.
  106. Learning from noisy labels by regularized estimation of annotator confusion. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  107. Max-margin majority voting for learning from crowds. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(10):2480–2494, 2019.
  108. Learning from disagreement: A survey. Journal of Artifical Intelligence Research, 72:1385–1470, 2021.
  109. Scaling and disagreements: Bias, noise, and ambiguity. Frontiers in Artifical Intelligence, 5, 2022.
  110. S. Vigna. A weighted correlation index for rankings with ties. In Proc. of the International World Wide Web Conference (WWW), 2015.
  111. ilab at semeval-2023 task 11 le-wi-di: Modelling disagreement or modelling perspectives? arXiv.org, abs/2305.06074, 2023.
  112. On truth discovery in social sensing: a maximum likelihood estimation approach. In Proc. of the International Conference on Information Processing in Sensor Networks IPSN, 2012.
  113. Pico: Contrastive label disambiguation for partial label learning. In Proc. of the International Conference on Learning Representations (ICLR), 2022.
  114. Sleep-spindle detection: crowdsourcing and evaluating performance of experts, non-experts and automated methods. Nature methods, 11(4):385–392, 2014.
  115. A similarity measure for indefinite rankings. ACM Transactions on Information Systems (TOIS), 28(4):20:1–20:38, 2010.
  116. P. Welinder and P. Perona. Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2010.
  117. The multidimensional wisdom of crowds. In Advances in Neural Information Processing Systems (NeurIPS), 2010.
  118. S. Wu and F. Crestani. Methods for ranking information retrieval systems without relevance judgments. In Proc. of the ACM Symposium on Applied Computing (SAC), 2003.
  119. Wikipedia Talk Labels: Personal Attacks. https://figshare.com/articles/dataset/Wikipedia_Talk_Labels_Personal_Attacks/4054689/6, 2017.
  120. Learning from multiple annotators with varying expertise. Machine Learning, 95(3):291–327, 2014.
  121. Truth discovery with multiple conflicting information providers on the web. IEEE Transactions on Knowledge and Data Engineering, 20(6):796–808, 2008.
  122. Learning from label proportions by learning with label noise. arXiv.org, abs/2203.02496, 2022.
  123. Disentangling human error from the ground truth in segmentation of medical images. arXiv.org, abs/2007.15963, 2020.
  124. A bayesian approach to discovering truth from conflicting sources for data integration. Proc. of the VLDB Endowment, 5(6):550–561, 2012.
  125. Truth inference in crowdsourcing: Is the problem solved? Proc. of the VLDB Endowment, 10(5):541–552, 2017.
Citations (4)

Summary

  • The paper introduces a novel framework that models ground truth uncertainty using statistical inference.
  • It applies uncertainty-adjusted metrics in dermatology to show that standard evaluations can overestimate AI performance.
  • The framework offers practical guidance for safer AI deployment by quantifying annotator variability in medical imaging.

Evaluating AI Systems under Uncertain Ground Truth: A Case Study in Dermatology

This paper addresses a critical issue in the evaluation of AI systems within health contexts—uncertainty in the ground truth used for validation. Traditional evaluation methods often assume a certain and fixed ground truth derived by deterministically aggregating annotations, such as majority voting. However, this assumption does not hold true in many health-related scenarios, where the ground truth can be inherently uncertain due to factors like annotator disagreement or insufficient observational data.

Ground Truth Uncertainty

The authors dissect ground truth uncertainty into two primary components: annotation uncertainty and inherent uncertainty. Annotation uncertainty arises from the imperfections in the labeling process, even when experts are involved. Inherent uncertainty, on the other hand, is due to ambiguous cases where observational information is limited and tasks might be subjective.

Ignoring these forms of uncertainty can lead to an overestimation of AI system performance. To tackle this, the paper proposes a statistical framework that models the aggregation of annotations as a posterior inference problem. This framework evaluates AI systems while explicitly accounting for the uncertainty inherent in the ground truth.

Proposed Framework

The framework introduces the concept of "plausibilities," which are distributions over possible classes in a classification task, derived through a statistical model of annotator reliability. The authors develop uncertainty-adjusted metrics to better evaluate AI systems, offering a more nuanced understanding of their performance.

For the case paper in dermatology, the authors focus on skin condition classification from images, where the annotations are given as differential diagnoses. Two statistical models are introduced for aggregation: a probabilistic interpretation of the inverse rank normalization (IRN) and a Plackett–Luce-based model. Both models highlight substantial ground truth uncertainty, which traditional methods tend to overlook.

Numerical Results and Implications

The paper finds significant ground truth uncertainty in a large portion of the dataset, demonstrated by uncertainty-adjusted metrics such as top-k accuracy and average overlap. The analysis reveals that standard, IRN-based evaluations can considerably overestimate classifier performance, masking the true variability and reliability of AI systems.

Practical and Theoretical Implications

The implications of this research are profound, both practically and theoretically. Practically, the findings advocate for more cautious deployment of AI systems in medical settings, where decisions often carry high stakes. For model developers, this framework provides a more robust method for model selection and performance evaluation.

Theoretically, this approach enriches our understanding of uncertainty in machine learning systems. It offers a pathway to a more rigorous statistical treatment of annotations, which can lead to better generalization and trust in AI predictions.

Future Directions

The proposed framework can be seen as a stepping stone for further research into evaluation methodologies that respect the complexity of real-world data. Future work could enhance the statistical models used for aggregation or explore different types of uncertainty in other domains, offering broader applicability and tunable solutions for varying levels of uncertainty.

In summary, this paper provides valuable insights into handling uncertain ground truth in AI evaluations, particularly within healthcare, and offers a concrete approach to integrating uncertainty into performance metrics, thus contributing to the development of safer and more reliable AI systems.