Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating the Utility of Conformal Prediction Sets for AI-Advised Image Labeling (2401.08876v7)

Published 16 Jan 2024 in cs.HC, cs.CV, and cs.LG

Abstract: As deep neural networks are more commonly deployed in high-stakes domains, their black-box nature makes uncertainty quantification challenging. We investigate the presentation of conformal prediction sets--a distribution-free class of methods for generating prediction sets with specified coverage--to express uncertainty in AI-advised decision-making. Through a large online experiment, we compare the utility of conformal prediction sets to displays of Top-1 and Top-k predictions for AI-advised image labeling. In a pre-registered analysis, we find that the utility of prediction sets for accuracy varies with the difficulty of the task: while they result in accuracy on par with or less than Top-1 and Top-k displays for easy images, prediction sets offer some advantage in assisting humans in labeling out-of-distribution (OOD) images in the setting that we studied, especially when the set size is small. Our results empirically pinpoint practical challenges of conformal prediction sets and provide implications on how to incorporate them for real-world decision-making.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Sandro Ambuehl and Shengwu Li. 2018. Belief updating and the demand for information. Games and Economic Behavior 109 (2018), 21–39.
  2. Concrete problems in AI safety. arXiv preprint arXiv:1606.06565 (2016).
  3. Uncertainty sets for image classifiers using conformal prediction. arXiv preprint arXiv:2009.14193 (2020).
  4. Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511 (2021).
  5. On the utility of prediction sets in human-AI teams. In Proceedings of the Thirty-First International Joint Conferenceon Artificial Intelligence (IJCAI-22). 2457–2463.
  6. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. In Proceedings of the 2021 ACM CHI Conference on Human Factors in Computing Systems. ACM, 1–16.
  7. A human-centered evaluation of a deep learning system deployed in clinics for the detection of diabetic retinopathy. In Proceedings of the 2020 ACM CHI Conference on Human Factors in Computing Systems. ACM, 1–12.
  8. Improving the driver–automation interaction: An approach using automation uncertainty. Human Factors 55, 6 (2013), 1130–1141.
  9. Alexander Budanitsky and Graeme Hirst. 2006. Evaluating WordNet-based measures of lexical semantic relatedness. Computational Linguistics 32, 1 (2006), 13–47.
  10. Feature-based explanations don’t help people detect misclassifications of online toxicity. In Proceedings of the international AAAI conference on web and social media, Vol. 14. AAAI Press, 95–106.
  11. Geoff Cumming. 2014. The new statistics: Why and how. Psychological Science 25, 1 (2014), 7–29.
  12. Geoff Cumming and Sue Finch. 2005. Inference by eye: Confidence intervals and how to read pictures of data. American Psychologist 60, 2 (2005), 170–180.
  13. ImageNet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
  14. Finding label noise examples in large scale datasets. In Proceedings of the 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). IEEE, 2420–2424.
  15. Christiane Fellbaum. 1998. WordNet: An electronic lexical database. MIT Press.
  16. Deep learning-based image recognition for autonomous driving. IATSS research 43, 4 (2019), 244–252.
  17. Bayesian workflow. arXiv preprint arXiv:2011.01808 (2020).
  18. Adversarially robust conformal prediction. In Proceedings of the ICLR.
  19. Explaining and harnessing adversarial examples. Proceedings of the ICLR.
  20. Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making. In Proceedings of the ACM on Human-Computer Interaction 3, CSCW (2019), 1–24.
  21. On calibration of modern neural networks. In International conference on machine learning. PMLR, International Conference on Machine Learning (ICML), 1321–1330.
  22. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 770–778.
  23. Human-AI Complementarity in Hybrid Intelligence Systems: A Structured Literature Review. PACIS (2021), 78.
  24. Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. In Proceedings of the ICLR.
  25. In pursuit of error: A survey of uncertainty visualization evaluation. IEEE Transactions on Visualization and Computer Graphics 25, 1 (2018), 903–913.
  26. Hypothetical outcome plots outperform error bars and violin plots for inferences about reliability of variable ordering. PloS One 10, 11 (2015), e0142444.
  27. Susan L Joslyn and Jared E LeClerc. 2012. Uncertainty forecasts improve weather-related decisions and attenuate the effects of forecast error. Journal of Experimental Psychology: Applied 18, 1 (2012), 126–140.
  28. CODiT: Conformal out-of-distribution Detection in time-series data for cyber-physical systems. In Proceedings of the ACM/IEEE 14th International Conference on Cyber-Physical Systems (with CPS-IoT Week 2023) (ICCPS ’23). ACM, 120–131.
  29. When (ish) is my bus? User-centered visualizations of uncertainty in everyday, mobile predictive systems. In Proceedings of the 2016 ACM CHI Conference on Human Factors in Computing Systems. ACM, 5092–5103.
  30. Alex Krizhevsky. 2009. Learning multiple layers of features from tiny images. Master’s Thesis, University of Toronto (2009).
  31. Human-AI Collaboration via Conditional Delegation: A Case Study of Content Moderation. In Proceedings of the 2022 ACM CHI Conference on Human Factors in Computing Systems. ACM, Article 54, 1–18 pages.
  32. Vivian Lai and Chenhao Tan. 2019. On human predictions with explanations and predictions of machine learning models: A case study on deception detection. In Proceedings of the conference on fairness, accountability, and transparency. ACM, 29–38.
  33. Understanding the effect of out-of-distribution examples and interactive explanations on human-AI decision making. In Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–45.
  34. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083 (2017).
  35. Charles F Manski. 2019. Communicating uncertainty in policy analysis. In Proceedings of the National Academy of Sciences 116, 16 (2019), 7634–7641.
  36. Benchmarking robustness in object detection: Autonomous driving when winter is coming. arXiv preprint arXiv:1907.07484 (2019).
  37. George A Miller. 1995. WordNet: A lexical database for English. Commun. ACM 38, 11 (1995), 39–41.
  38. Measuring calibration in deep learning. In Proceedings of the CVPR workshops, Vol. 2. IEEE.
  39. Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749 (2021).
  40. Inductive confidence machines for regression. In Proceedings of the Machine Learning: ECML 2002: 13th European Conference on Machine Learning. Springer Berlin, Heidelberg, 345–356.
  41. Adam Paszke et al. 2019. PyTorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems 32 (2019).
  42. John Platt. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers 10, 3 (1999), 61–74.
  43. Understanding Uncertainty: How Lay Decision-makers Perceive and Interpret Uncertainty in Human-AI Decision Making. In Proceedings of the 28th International Conference on Intelligent User Interfaces. ACM, 379–396.
  44. Dataset shift in machine learning. Mit Press.
  45. Deep learning for medical image processing: Overview, challenges and the future. Classification in BioApps: Automation of Decision Making (2018), 323–350.
  46. Philip Resnik. 1995. Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 1 (IJCAI’95). Morgan Kaufmann Publishers Inc., 448–453.
  47. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems 33 (2020), 3581–3591.
  48. A survey of image labelling for computer vision applications. Journal of Business Analytics 4, 2 (2021), 91–110.
  49. Glenn Shafer and Vladimir Vovk. 2008. A tutorial on conformal prediction. Journal of Machine Learning Research 9, 3 (2008).
  50. Herbert A Simon. 1956. Rational choice and the structure of the environment. Psychological Review 63, 2 (1956), 129–138.
  51. Bayesian modeling of human–AI complementarity. In Proceedings of the National Academy of Sciences 119, 11 (2022), e2111547119.
  52. Eleni Straitouri and Manuel Gomez Rodriguez. 2023. Designing decision support systems using counterfactual prediction sets. arXiv preprint arXiv:2306.03928 (2023).
  53. Barry N Taylor and Chris E Kuyatt. 1994. Guidelines for evaluating and expressing the uncertainty of NIST measurement results. Vol. 1297. National Institute of Standards and Technology.
  54. Antonio Torralba and Alexei A Efros. 2011. Unbiased look at dataset bias. In Proceedings of the CVPR. IEEE, 1521–1528.
  55. Amos Tversky and Daniel Kahneman. 1971. Belief in the law of small numbers. Psychological bulletin 76, 2 (1971), 105.
  56. Biological data annotation via a human-augmenting AI-based labeling system. NPJ Digital Medicine 4, 1 (2021), 145.
  57. Algorithmic learning in a random world. Vol. 29. Springer.
  58. Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In Proceedings of the British Machine Vision Conference 2016. British Machine Vision Association, BMVA Press.
  59. Visualizing uncertainty in probabilistic graphs with Network Hypothetical Outcome Plots (NetHOPs). IEEE Transactions on Visualization and Computer Graphics 28, 1 (2021), 443–453.
  60. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In Proceedings of the 2020 ACM Conference on Fairness, Accountability, and Transparency. ACM, 295–305.
  61. Effects of uncertainty and cognitive load on user trust in predictive decision making. In Proceedings of the 16th IFIP TC 13 International Conference on Human-Computer Interaction — INTERACT 2017. Springer-Verlag, 23–39.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Dongping Zhang (9 papers)
  2. Angelos Chatzimparmpas (16 papers)
  3. Negar Kamali (3 papers)
  4. Jessica Hullman (46 papers)
Citations (4)