Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
149 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Predict the Next Word: Humans exhibit uncertainty in this task and language models _____ (2402.17527v2)

Published 27 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs (LMs) are statistical models trained to assign probability to human-generated text. As such, it is reasonable to question whether they approximate linguistic variability exhibited by humans well. This form of statistical assessment is difficult to perform at the passage level, for it requires acceptability judgements (i.e., human evaluation) or a robust automated proxy (which is non-trivial). At the word level, however, given some context, samples from an LM can be assessed via exact matching against a prerecorded dataset of alternative single-word continuations of the available context. We exploit this fact and evaluate the LM's ability to reproduce variability that humans (in particular, a population of English speakers) exhibit in the 'next word prediction' task. This can be seen as assessing a form of calibration, which, in the context of text classification, Baan et al. (2022) termed calibration to human uncertainty. We assess GPT2, BLOOM and ChatGPT and find that they exhibit fairly low calibration to human uncertainty. We also verify the failure of expected calibration error (ECE) to reflect this, and as such, advise the community against relying on it in this setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. On the calibration of massively multilingual language models. arXiv e-prints, pages arXiv–2210.
  2. Crowdsourcing subjective tasks: The case study of understanding toxicity in online discussions. In Companion Proceedings of The 2019 World Wide Web Conference, WWW ’19, page 1100–1105, New York, NY, USA. Association for Computing Machinery.
  3. Stop measuring calibration when humans disagree. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1892–1915, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  4. Valerio Basile et al. 2020. It’s the end of the gold standard as we know it. on the impact of pre-aggregation on the evaluation of highly subjective tasks. In CEUR WORKSHOP PROCEEDINGS, volume 2776, pages 31–40. CEUR-WS.
  5. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  6. Kris Cao and Laura Rimell. 2021. You should evaluate your language model on marginal likelihood over tokenisations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2104–2114, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  7. A close look into the calibration of pre-trained language models. arXiv e-prints, pages arXiv–2211.
  8. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
  9. Selectively answering ambiguous questions. arXiv preprint arXiv:2305.14613.
  10. Hillary Dawkins and Isar Nejadgholi. 2022. Region-dependent temperature scaling for certainty calibration and application to class-imbalanced token classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 538–544, Dublin, Ireland. Association for Computational Linguistics.
  11. Cloze distillation: Improving neural language models with human next-word prediction. In Proceedings of the 24th Conference on Computational Natural Language Learning, pages 609–619.
  12. What comes next? evaluating uncertainty in neural text generators against human production variability. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  13. Beyond hard labels: investigating data label distributions. arXiv preprint arXiv:2207.06224.
  14. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  15. Calibration of neural networks using splines. In International Conference on Learning Representations.
  16. Preserving pre-trained features helps calibrate fine-tuned language models. arXiv preprint arXiv:2305.19249.
  17. Reward learning from human preferences and demonstrations in atari. Advances in neural information processing systems, 31.
  18. Tailoring language generation models under total variation distance. In The Eleventh International Conference on Learning Representations.
  19. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
  20. Clam: Selective clarification for ambiguous questions with generative language models.
  21. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. arXiv preprint arXiv:2302.09664.
  22. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In NeurIPS, pages 12295–12305.
  23. Aviral Kumar and Sunita Sarawagi. 2019. Calibration of encoder decoder models for neural machine translation. arXiv preprint arXiv:1903.00802.
  24. Trainable calibration measures for neural networks from kernel mean embeddings. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2805–2814. PMLR.
  25. Matthieu Labeau and Shay B. Cohen. 2019. Experimenting with power divergences for language modeling. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4104–4114, Hong Kong, China. Association for Computational Linguistics.
  26. Evaluating distributional distortion in neural language modeling. arXiv preprint arXiv:2203.12788.
  27. Adaptive label smoothing with self-knowledge in natural language generation. arXiv preprint arXiv:2210.13459.
  28. Can large language models capture dissenting human voices? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4569–4585.
  29. Calibration meets explanation: A simple and effective approach for model confidence estimates. arXiv preprint arXiv:2211.03041.
  30. Steven G Luke and Kiel Christianson. 2018. The provo corpus: A large eye-tracking corpus with predictability norms. Behavior research methods, 50:826–833.
  31. Clara Meister and Ryan Cotterell. 2021. Language model evaluation beyond perplexity. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5328–5339, Online. Association for Computational Linguistics.
  32. Recurrent neural network based language model. In Interspeech, volume 2, pages 1045–1048. Makuhari.
  33. When does label smoothing help? Advances in neural information processing systems, 32.
  34. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the AAAI conference on artificial intelligence, volume 29.
  35. Radford M Neal. 2012. Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
  36. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632.
  37. What can we learn from collective human opinions on natural language inference data? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9131–9143, Online. Association for Computational Linguistics.
  38. Measuring calibration in deep learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
  39. OpenAI. 2022. Introducing chatgpt. Available at https://openai.com/blog/chatgpt.
  40. Barbara Plank. 2022. The “problem” of human label variation: On ground truth in data, modeling and evaluation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10671–10682, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  41. On releasing annotator-level labels and information in datasets. arXiv preprint arXiv:2110.05699.
  42. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  43. Whose opinions do language models reflect? arXiv preprint arXiv:2303.17548.
  44. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  45. Modeling statistical properties of written text. PloS one, 4(4):e5372.
  46. Re-examining calibration: The case of question answering. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2814–2829, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  47. Shuntaro Takahashi and Kumiko Tanaka-Ishii. 2019. Evaluating computational language models with scaling properties of natural language. Computational Linguistics, 45(3):481–513.
  48. Just ask for calibration: Strategies for eliciting calibrated confidence scores from language models fine-tuned with human feedback. arXiv preprint arXiv:2305.14975.
  49. Evaluating model calibration in classification. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 3459–3467. PMLR.
  50. On the inference calibration of neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3070–3079, Online. Association for Computational Linguistics.
  51. Calibration tests in multi-class classification: A unifying framework. Advances in Neural Information Processing Systems, 32.
  52. Sandra Williams and Ehud Reiter. 2008. Generating basic skills reports for low-skilled readers. Natural Language Engineering, 14(4):495–525.
  53. Yijun Xiao and William Yang Wang. 2021. On hallucination and predictive uncertainty in conditional language generation. arXiv preprint arXiv:2103.15025.
  54. Uncertainty quantification with pre-trained language models: A large-scale empirical analysis. arXiv e-prints, pages arXiv–2210.
  55. Mixce: Training autoregressive language models by mixing forward and reverse cross-entropies. arXiv preprint arXiv:2305.16958.
  56. Navigating the grey area: Expressions of overconfidence and uncertainty in language models. arXiv preprint arXiv:2302.13439.

Summary

We haven't generated a summary for this paper yet.