Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Benchmarking LLMs via Uncertainty Quantification (2401.12794v3)

Published 23 Jan 2024 in cs.CL

Abstract: The proliferation of open-source LLMs from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves nine LLMs (LLM series) spanning five representative natural language processing tasks. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs.

Introduction

LLMs have captured significant attention in their ability to perform an array of NLP tasks. Notwithstanding, the conventional evaluation metrics often miss out on one vital aspect—uncertainty. Current benchmarking platforms such as HuggingFace's leaderboard only report accuracy, neglecting the underlying confidence levels regarding model outputs.

Uncertainty Quantification in LLMs

In response to this oversight, the authors propose incorporating uncertainty quantification into the evaluation of LLMs. They utilize conformal prediction, a method that offers several advantages over alternatives like Bayesian variational inference: simplicity of implementation, computational efficiency, and a statistically robust estimation of uncertainty. Through this lens, the paper benchmarks eight open-source LLMs across five NLP tasks, demonstrating that higher accuracy does not always entail lower uncertainty, larger LLMs can show greater uncertainty, and instruction-finetuning increases uncertainty.

Evaluation Tasks, Prompts, and Metrics

The tasks tested range from question answering to document summarization, all standardized into a multiple-choice format to uniformly measure output uncertainty. Several prompting strategies were employed to ensure a fair trial by reducing LLM sensitivity to prompts' variance. The paper also introduces a novel metric—Uncertainty-aware Accuracy (UAcc), complementing standard accuracy with an uncertainty measure. UAcc has shown to adjust the perceived performance improvement between models when factoring in their respective certainty levels.

Key Findings and Implications

LLMs with larger scales, surprisingly, manifested greater uncertainty. Moreover, instruction-finetuning, a method aimed at enhancing model performance on downstream tasks, tended to increase model uncertainty. These findings have implications not only for LLM development but also for deployment, as practitioners must weigh model accuracy against consistency and reliability signaled by the model's uncertainty.

Coupled with the proof that conformal prediction efficiently quantifies uncertainty in LLM output across various tasks, these insights pave the way for more informed usage and continued improvement of LLMs. The consideration of uncertainty quantification as suggested unpacks a deeper understanding of the models, enabling better trust calibration in their applications. Although the authors recognize limitations, such as the current inability to apply conformal prediction to models like ChatGPT or to generative tasks, they envisage future advancements that might address these gaps.

Conclusion and Future Directions

In conclusion, this paper emphasizes the significant role of uncertainty quantification in evaluating LLMs and prompts a shift in benchmarking standards. As we anticipate the emergence of multimodal foundation models, this work sheds light on the aspects of evaluation that might extend beyond language, into other modalities. The enhancement in safety, reliability, and usefulness of LLMs in practical scenarios emerges as the overarching goal of this scholarly investigation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. 01.AI. 2023. Yi series. https://www.lingyiwanwu.com/en.
  2. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information fusion, 76:243–297.
  3. The falcon series of language models: Towards open frontier models.
  4. Uncertainty sets for image classifiers using conformal prediction. arXiv preprint arXiv:2009.14193.
  5. Anastasios N Angelopoulos and Stephen Bates. 2021. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511.
  6. BAAI. 2023. Flageval: An open-source evaluation toolkit and an open platform for evaluation of large models. https://github.com/FlagOpen/FlagEval.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609.
  8. Conformal prediction for reliable machine learning: theory, adaptations and applications. Newnes.
  9. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  10. Jiuhai Chen and Jonas Mueller. 2023. Quantifying uncertainty in answers from any language model via intrinsic and extrinsic confidence assessment. arXiv preprint arXiv:2308.16175.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374.
  12. Felm: Benchmarking factuality evaluation of large language models. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  13. OpenCompass Contributors. 2023. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass.
  14. DeepSeek. 2023. Deepseek llm: Let there be answers. https://github.com/deepseek-ai/DeepSeek-LLM.
  15. Conformal autoregressive generation: Beam search with coverage guarantees. arXiv preprint arXiv:2309.03797.
  16. Conformal prediction for text infilling and part-of-speech prediction. The New England Journal of Statistics in Data Science, 1(1):69–83.
  17. Bold: Dataset and metrics for measuring biases in open-ended language generation. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 862–872.
  18. A survey for in-context learning. arXiv preprint arXiv:2301.00234.
  19. Efficient conformal prediction via cascaded inference with expanded admission. arXiv preprint arXiv:2007.03114.
  20. Conformal prediction: a unified review of theory and new challenges. Bernoulli, 29(1):1–23.
  21. A framework for few-shot language model evaluation.
  22. A survey of uncertainty in deep neural networks. Artificial Intelligence Review, 56(Suppl 1):1513–1589.
  23. Patrizio Giovannotti and Alex Gammerman. 2021. Transformer-based conformal predictors for paraphrase detection. In Conformal and Probabilistic Prediction and Applications, pages 243–265. PMLR.
  24. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
  25. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  26. Uncertainty in natural language processing: Sources, quantification, and applications. arXiv preprint arXiv:2306.04459.
  27. Cosmos QA: Machine reading comprehension with contextual commonsense reasoning. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2391–2401, Hong Kong, China. Association for Computational Linguistics.
  28. Look before you leap: An exploratory study of uncertainty measurement for large language models. arXiv preprint arXiv:2307.10236.
  29. Mistral 7b. arXiv preprint arXiv:2310.06825.
  30. Challenges and applications of large language models. arXiv preprint arXiv:2307.10169.
  31. Conformal prediction with large language models for multi-choice question answering. arXiv preprint arXiv:2305.18404.
  32. Uncertainty quantification using bayesian neural networks in classification: Application to biomedical image segmentation. Computational Statistics & Data Analysis, 142:106816.
  33. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
  34. A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering. arXiv preprint arXiv:2311.07536.
  35. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  36. Generating with confidence: Uncertainty quantification for black-box large language models. arXiv preprint arXiv:2305.19187.
  37. Retrieval-augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:2312.01714.
  38. Federated conformal predictors for distributed uncertainty quantification. In International Conference on Machine Learning, pages 22942–22964. PMLR.
  39. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093.
  40. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34:15682–15694.
  41. OpenDialKG: Explainable conversational reasoning with attention-based walks over knowledge graphs. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 845–854, Florence, Italy. Association for Computational Linguistics.
  42. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
  43. Peter Norvig. 1987. A unified theory of inference for text understanding. Ph.D. thesis, University of California, Berkeley.
  44. Conformal language modeling. arXiv preprint arXiv:2306.10193.
  45. Rahul Rahaman et al. 2021. Uncertainty quantification and deep ensembles. Advances in Neural Information Processing Systems, 34:20063–20075.
  46. Conformal nucleus sampling. In Findings of the Association for Computational Linguistics: ACL 2023, pages 27–34, Toronto, Canada. Association for Computational Linguistics.
  47. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581–3591.
  48. Least ambiguous set-valued classifiers with bounded error levels. Journal of the American Statistical Association, 114(525):223–234.
  49. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083, Vancouver, Canada. Association for Computational Linguistics.
  50. InternLM Team. 2023a. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
  51. MosaicML NLP Team. 2023b. Introducing mpt-7b: A new standard for open-source, commercially usable llms. www.mosaicml.com/blog/mpt-7b. Accessed: 2023-05-05.
  52. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  53. Algorithmic learning in a random world, volume 29. Springer.
  54. Empirical evaluation of uncertainty quantification in retrieval-augmented language models for science. arXiv preprint arXiv:2311.09358.
  55. An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
  56. Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms. arXiv preprint arXiv:2306.13063.
  57. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
  58. Improving the reliability of large language models by leveraging uncertainty-aware in-context learning. arXiv preprint arXiv:2310.04782.
  59. Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv preprint arXiv:2307.10928.
  60. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  61. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
  62. A survey of large language models. arXiv preprint arXiv:2303.18223.
  63. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Fanghua Ye (30 papers)
  2. Mingming Yang (12 papers)
  3. Jianhui Pang (5 papers)
  4. Longyue Wang (87 papers)
  5. Derek F. Wong (69 papers)
  6. Emine Yilmaz (66 papers)
  7. Shuming Shi (126 papers)
  8. Zhaopeng Tu (135 papers)
Citations (31)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com