Interpreting Predictive Probabilities: Model Confidence or Human Label Variation? (2402.16102v1)
Abstract: With the rise of increasingly powerful and user-facing NLP systems, there is growing interest in assessing whether they have a good representation of uncertainty by evaluating the quality of their predictive distribution over outcomes. We identify two main perspectives that drive starkly different evaluation protocols. The first treats predictive probability as an indication of model confidence; the second as an indication of human label variation. We discuss their merits and limitations, and take the position that both are crucial for trustworthy and fair NLP systems, but that exploiting a single predictive distribution is limiting. We recommend tools and highlight exciting directions towards models with disentangled representations of uncertainty about predictions and uncertainty about human labels.
- Modeling annotator perspective and polarized opinions to improve hate speech detection. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 8, pages 151–154.
- Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, pages 184–190, Online. Association for Computational Linguistics.
- Qampari: An open-domain question answering benchmark for questions with many answers from multiple paragraphs.
- Crowdsourcing subjective tasks: The case study of understanding toxicity in online discussions. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, page 1100–1105, New York, NY, USA. Association for Computing Machinery.
- Lora Aroyo and Chris Welty. 2015. Truth is a lie: Crowd truth and the seven myths of human annotation. AI Magazine, 36(1):15–24.
- Stop measuring calibration when humans disagree. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1892–1915, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Uncertainty in natural language generation: From theory to applications. arXiv preprint arXiv:2307.15703.
- We need to consider disagreement in evaluation. In Proceedings of the 1st Workshop on Benchmarking: Past, Present and Future, pages 15–21, Online. Association for Computational Linguistics.
- Joint emotion analysis via multi-task Gaussian processes. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1798–1803, Doha, Qatar. Association for Computational Linguistics.
- Beata Beigman Klebanov and Eyal Beigman. 2014. Difficult cases: From data to learning, and back. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 390–396, Baltimore, Maryland. Association for Computational Linguistics.
- Confidence estimation for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 315–321, Geneva, Switzerland. COLING.
- Uncertain natural language inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8772–8779, Online. Association for Computational Linguistics.
- Eliciting and learning with soft labels from every annotator. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 10, pages 40–52.
- Soham Dan and Dan Roth. 2021. On the effects of transformer size on in- and out-of-domain calibration. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2096–2101, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- GoEmotions: A dataset of fine-grained emotions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4040–4054, Online. Association for Computational Linguistics.
- Armen Der Kiureghian and Ove Ditlevsen. 2009. Aleatory or epistemic? does it matter? Structural safety, 31(2):105–112.
- Shrey Desai and Greg Durrett. 2020. Calibration of pre-trained transformers. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 295–302, Online. Association for Computational Linguistics.
- A diachronic perspective on user trust in AI under uncertainty. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 5567–5580, Singapore. Association for Computational Linguistics.
- Confidence modeling for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 743–753, Melbourne, Australia. Association for Computational Linguistics.
- Is GPT-3 text indistinguishable from human text? scarecrow: A framework for scrutinizing machine text. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7250–7274, Dublin, Ireland. Association for Computational Linguistics.
- Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
- Beyond black & white: Leveraging annotator disagreement via soft-label multi-task learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2591–2597, Online. Association for Computational Linguistics.
- Scalable Bayesian learning of recurrent neural networks for language modeling. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 321–331, Vancouver, Canada. Association for Computational Linguistics.
- Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1161–1166, Hong Kong, China. Association for Computational Linguistics.
- Patrizio Giovannotti and Alex Gammerman. 2021. Transformer-based conformal predictors for paraphrase detection. In Proceedings of the Tenth Symposium on Conformal and Probabilistic Prediction and Applications, volume 152 of Proceedings of Machine Learning Research, pages 243–265. PMLR.
- Ambifc: Fact-checking ambiguous claims with evidence. arXiv preprint arXiv:2104.00640.
- Uncertainty-aware machine translation evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3920–3938, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Yoav Goldberg and Graeme Hirst. 2017. Neural Network Methods in Natural Language Processing. Morgan & Claypool Publishers.
- Jury learning: Integrating dissenting voices into machine learning models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pages 1–19.
- The disagreement deconvolution: Bringing machine learning performance metrics in line with reality. In Association for Computing Machinery, CHI ’21, New York, NY, USA.
- Sources of uncertainty in machine learning–a statisticians’ view. arXiv preprint arXiv:2305.16703.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
- Calibration of neural networks using splines. In International Conference on Learning Representations.
- Joseph Y Halpern. 2017. Reasoning about uncertainty. MIT press.
- Eyke Hüllermeier and Willem Waegeman. 2021. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110:457–506.
- Nan-Jiang Jiang and Marie-Catherine de Marneffe. 2022. Investigating reasons for disagreement in natural language inference. Transactions of the Association for Computational Linguistics, 10:1357–1374.
- Understanding and predicting human label variation in natural language inference through explanation. arXiv preprint arXiv:2304.12443.
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics, 9:962–977.
- Selective question answering under domain shift. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5684–5696, Online. Association for Computational Linguistics.
- Introducing the gab hate corpus: defining and applying hate-based rhetoric to social media posts at scale. Language Resources and Evaluation, pages 1–30.
- Annotation error detection: Analyzing the past and present for a more coherent future. Computational Linguistics, pages 1–42.
- Andrey N. Kolmogorov. 1960. Foundations of the Theory of Probability, 2 edition. Chelsea Pub Co.
- Calibrated language model fine-tuning for in- and out-of-distribution data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1326–1340, Online. Association for Computational Linguistics.
- Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32.
- Verified uncertainty calibration. Advances in Neural Information Processing Systems, 32.
- Calibration meets explanation: A simple and effective approach for model confidence estimates. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2775–2784, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Dennis V Lindley. 2013. Understanding uncertainty. John Wiley & Sons.
- We’re afraid language models aren’t modeling ambiguity. arXiv preprint arXiv:2304.14399.
- Discrepancy ratio: Evaluating model performance when even experts disagree on the truth. In International Conference on Learning Representations.
- Bert-based conformal predictor for sentiment analysis. In Proceedings of the Ninth Symposium on Conformal and Probabilistic Prediction and Applications, volume 128 of Proceedings of Machine Learning Research, pages 269–284. PMLR.
- Christopher D Manning. 2011. Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In Computational Linguistics and Intelligent Text Processing: 12th International Conference, CICLing 2011, Tokyo, Japan, February 20-26, 2011. Proceedings, Part I 12, pages 171–189. Springer.
- Embracing ambiguity: Shifting the training target of NLI models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 862–869, Online. Association for Computational Linguistics.
- AmbigQA: Answering ambiguous open-domain questions. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5783–5797, Online. Association for Computational Linguistics.
- Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10:92–110.
- Deep deterministic uncertainty: A new simple baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24384–24394.
- Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
- What can we learn from collective human opinions on natural language inference data? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9131–9143, Online. Association for Computational Linguistics.
- Measuring calibration in deep learning. In CVPR workshops, 7.
- Seo Yeon Park and Cornelia Caragea. 2022. On the calibration of pre-trained language models using mixup guided by area under the margin and saliency. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5364–5374, Dublin, Ireland. Association for Computational Linguistics.
- Comparing Bayesian models of annotation. Transactions of the Association for Computational Linguistics, 6:571–585.
- Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent disagreements in human textual inferences. Transactions of the Association for Computational Linguistics, 7:677–694.
- Transformer uncertainty estimation with hierarchical stochastic attention. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11147–11155.
- Human uncertainty makes classification more robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9617–9626.
- Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Linguistically debatable or just plain wrong? In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 507–511, Baltimore, Maryland. Association for Computational Linguistics.
- Massimo Poesio and Ron Artstein. 2005. The reliability of anaphoric annotation, reconsidered: Taking ambiguity into account. In Proceedings of the Workshop on Frontiers in Corpus Annotations II: Pie in the Sky, pages 76–83, Ann Arbor, Michigan. Association for Computational Linguistics.
- Parameter space factorization for zero-shot learning across tasks and languages. Transactions of the Association for Computational Linguistics, 9:410–428.
- Vikas C Raykar and Shipeng Yu. 2011. Ranking annotators for crowdsourced labeling tasks. Advances in neural information processing systems, 24.
- Two contrasting data annotation paradigms for subjective NLP tasks. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 175–190, Seattle, United States. Association for Computational Linguistics.
- DiscoGeM: A crowdsourced corpus of genre-mixed implicit discourse relations. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 3281–3290, Marseille, France. European Language Resources Association.
- Bayesian learning for neural dependency parsing. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 3509–3519, Minneapolis, Minnesota. Association for Computational Linguistics.
- How certain is your Transformer? In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1833–1840, Online. Association for Computational Linguistics.
- Humans meet models on object naming: A new dataset and analysis. In Proceedings of the 28th International Conference on Computational Linguistics, pages 1893–1905, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Estimating the sentence-level quality of machine translation systems. In Proceedings of the 13th Annual Conference of the European Association for Machine Translation, Barcelona, Spain. European Association for Machine Translation.
- Exploring predictive uncertainty and calibration in NLP: A study on the impact of method & data scarcity. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2707–2735, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- A case for soft loss functions. In Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, volume 8, pages 173–177.
- Learning from disagreement: A survey. Journal of Artificial Intelligence Research, 72:1385–1470.
- Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3459–3467. PMLR.
- Uncertainty estimation using a single deep deterministic neural network. In International conference on machine learning, pages 9690–9700. PMLR.
- Generation probabilities are not enough: Exploring the effectiveness of uncertainty highlighting in ai-powered code completions. arXiv preprint arXiv:2302.07248.
- Uncertainty estimation of transformer predictions for misclassification detection. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8237–8252, Dublin, Ireland. Association for Computational Linguistics.
- Uncalibrated models can improve human-ai collaboration. Advances in Neural Information Processing Systems, 35:4004–4016.
- Xinru Wang and Ming Yin. 2021. Are explanations helpful? a comparative study of the effects of explanations in ai-assisted decision-making. In 26th International Conference on Intelligent User Interfaces, IUI ’21, page 318–328, New York, NY, USA. Association for Computing Machinery.
- Leon Weber and Barbara Plank. 2023. ActiveAED: A human in the loop improves annotation error detection. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8834–8845, Toronto, Canada. Association for Computational Linguistics.
- Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations.
- Calibration tests in multi-class classification: A unifying framework. Advances in neural information processing systems, 32.
- Can explanations be useful for calibrating black box models? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6199–6212, Dublin, Ireland. Association for Computational Linguistics.
- Hiyori Yoshikawa and Naoaki Okazaki. 2023. Selective-lama: Selective prediction for confidence-aware evaluation of language models. In Findings of the Association for Computational Linguistics: EACL 2023, pages 1972–1983.
- Chrysoula Zerva and André FT Martins. 2023. Conformalizing machine translation evaluation. arXiv preprint arXiv:2306.06221.
- Learning with different amounts of annotation: From zero to many labels. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7620–7632, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, page 295–305, New York, NY, USA. Association for Computing Machinery.