Benchmarking LLMs via Uncertainty Quantification (2401.12794v3)

Published 23 Jan 2024 in cs.CL

Abstract: The proliferation of open-source LLMs from various institutions has highlighted the urgent need for comprehensive evaluation methods. However, current evaluation platforms, such as the widely recognized HuggingFace open LLM leaderboard, neglect a crucial aspect -- uncertainty, which is vital for thoroughly assessing LLMs. To bridge this gap, we introduce a new benchmarking approach for LLMs that integrates uncertainty quantification. Our examination involves nine LLMs (LLM series) spanning five representative natural language processing tasks. Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs. These results underscore the significance of incorporating uncertainty in the evaluation of LLMs.

PDF HTML Abstract

Introduction

LLMs have captured significant attention in their ability to perform an array of NLP tasks. Notwithstanding, the conventional evaluation metrics often miss out on one vital aspect—uncertainty. Current benchmarking platforms such as HuggingFace's leaderboard only report accuracy, neglecting the underlying confidence levels regarding model outputs.

Uncertainty Quantification in LLMs

In response to this oversight, the authors propose incorporating uncertainty quantification into the evaluation of LLMs. They utilize conformal prediction, a method that offers several advantages over alternatives like Bayesian variational inference: simplicity of implementation, computational efficiency, and a statistically robust estimation of uncertainty. Through this lens, the paper benchmarks eight open-source LLMs across five NLP tasks, demonstrating that higher accuracy does not always entail lower uncertainty, larger LLMs can show greater uncertainty, and instruction-finetuning increases uncertainty.

Evaluation Tasks, Prompts, and Metrics

The tasks tested range from question answering to document summarization, all standardized into a multiple-choice format to uniformly measure output uncertainty. Several prompting strategies were employed to ensure a fair trial by reducing LLM sensitivity to prompts' variance. The paper also introduces a novel metric—Uncertainty-aware Accuracy (UAcc), complementing standard accuracy with an uncertainty measure. UAcc has shown to adjust the perceived performance improvement between models when factoring in their respective certainty levels.

Key Findings and Implications

LLMs with larger scales, surprisingly, manifested greater uncertainty. Moreover, instruction-finetuning, a method aimed at enhancing model performance on downstream tasks, tended to increase model uncertainty. These findings have implications not only for LLM development but also for deployment, as practitioners must weigh model accuracy against consistency and reliability signaled by the model's uncertainty.

Coupled with the proof that conformal prediction efficiently quantifies uncertainty in LLM output across various tasks, these insights pave the way for more informed usage and continued improvement of LLMs. The consideration of uncertainty quantification as suggested unpacks a deeper understanding of the models, enabling better trust calibration in their applications. Although the authors recognize limitations, such as the current inability to apply conformal prediction to models like ChatGPT or to generative tasks, they envisage future advancements that might address these gaps.

Conclusion and Future Directions

In conclusion, this paper emphasizes the significant role of uncertainty quantification in evaluating LLMs and prompts a shift in benchmarking standards. As we anticipate the emergence of multimodal foundation models, this work sheds light on the aspects of evaluation that might extend beyond language, into other modalities. The enhancement in safety, reliability, and usefulness of LLMs in practical scenarios emerges as the overarching goal of this scholarly investigation.