Introduction
LLMs have captured significant attention in their ability to perform an array of NLP tasks. Notwithstanding, the conventional evaluation metrics often miss out on one vital aspect—uncertainty. Current benchmarking platforms such as HuggingFace's leaderboard only report accuracy, neglecting the underlying confidence levels regarding model outputs.
Uncertainty Quantification in LLMs
In response to this oversight, the authors propose incorporating uncertainty quantification into the evaluation of LLMs. They utilize conformal prediction, a method that offers several advantages over alternatives like Bayesian variational inference: simplicity of implementation, computational efficiency, and a statistically robust estimation of uncertainty. Through this lens, the paper benchmarks eight open-source LLMs across five NLP tasks, demonstrating that higher accuracy does not always entail lower uncertainty, larger LLMs can show greater uncertainty, and instruction-finetuning increases uncertainty.
Evaluation Tasks, Prompts, and Metrics
The tasks tested range from question answering to document summarization, all standardized into a multiple-choice format to uniformly measure output uncertainty. Several prompting strategies were employed to ensure a fair trial by reducing LLM sensitivity to prompts' variance. The paper also introduces a novel metric—Uncertainty-aware Accuracy (UAcc), complementing standard accuracy with an uncertainty measure. UAcc has shown to adjust the perceived performance improvement between models when factoring in their respective certainty levels.
Key Findings and Implications
LLMs with larger scales, surprisingly, manifested greater uncertainty. Moreover, instruction-finetuning, a method aimed at enhancing model performance on downstream tasks, tended to increase model uncertainty. These findings have implications not only for LLM development but also for deployment, as practitioners must weigh model accuracy against consistency and reliability signaled by the model's uncertainty.
Coupled with the proof that conformal prediction efficiently quantifies uncertainty in LLM output across various tasks, these insights pave the way for more informed usage and continued improvement of LLMs. The consideration of uncertainty quantification as suggested unpacks a deeper understanding of the models, enabling better trust calibration in their applications. Although the authors recognize limitations, such as the current inability to apply conformal prediction to models like ChatGPT or to generative tasks, they envisage future advancements that might address these gaps.
Conclusion and Future Directions
In conclusion, this paper emphasizes the significant role of uncertainty quantification in evaluating LLMs and prompts a shift in benchmarking standards. As we anticipate the emergence of multimodal foundation models, this work sheds light on the aspects of evaluation that might extend beyond language, into other modalities. The enhancement in safety, reliability, and usefulness of LLMs in practical scenarios emerges as the overarching goal of this scholarly investigation.