Can Large Language Models Express Uncertainty Like Human? (2509.24202v1)

Published 29 Sep 2025 in cs.CL and cs.AI

Abstract: LLMs are increasingly used in high-stakes settings, where overconfident responses can mislead users. Reliable confidence estimation has been shown to enhance trust and task accuracy. Yet existing methods face practical barriers: logits are often hidden, multi-sampling is computationally expensive, and verbalized numerical uncertainty (e.g., giving a 0-100 score) deviates from natural communication. We revisit linguistic confidence (LC), where models express uncertainty through hedging language (e.g., probably, might), offering a lightweight and human-centered alternative. To advance this direction, we (1) release the first diverse, large-scale dataset of hedging expressions with human-annotated confidence scores, and (2) propose a lightweight mapper that converts hedges into confidence scores at near-zero cost. Building on these resources, we (3) conduct the first systematic study of LC across modern LLMs and QA benchmarks, revealing that while most LLMs underperform in expressing reliable LC, carefully designed prompting achieves competitive calibration and discriminability. Finally, we (4) introduce a fine-tuning framework that further improves LC reliability. Taken together, our work positions linguistic confidence as a scalable, efficient, and human-aligned approach to LLM uncertainty estimation, and calls for deeper exploration of this promising yet underexplored direction.

Summary

The paper introduces linguistic confidence (LC) by employing hedging language to naturally express uncertainty and proposes a lightweight mapper to convert these expressions into confidence scores.
The methodology features a novel dataset and evaluation through metrics like ECE and AUROC, demonstrating improved calibration and discriminability compared to traditional approaches.
Fine-tuning with LoRA further refines LC, underscoring its potential to enhance trust and reliability in human-AI interactions.

Can LLMs Express Uncertainty Like Humans?

Introduction

The paper "Can LLMs Express Uncertainty Like Human?" (2509.24202) explores the critical aspect of expressing uncertainty in LLMs. Given their widespread application in domains such as education, healthcare, and law, the importance of reliable confidence estimation is paramount. The paper addresses the challenges posed by current methods: hidden logits, computational expense of multi-sampling, and the unnatural delivery of numerical uncertainty. It advocates for 'Linguistic Confidence' (LC), where models utilize hedging language, aligning closer with human communication.

Background and Motivation

Existing approaches to LLM uncertainty estimation either require inaccessible internal logits or resort to computationally intensive sampling techniques. Verbalized numerical scoring, although direct, deviates from natural communication. The paper revisits LC as a promising alternative. It introduces a diverse dataset and a lightweight mapping technique for converting hedging language into pragmatic confidence scores—a strategy with minimal computational cost and significant scalability.

The paper emphasizes the significance of calibrating machine responses through LC, thereby facilitating enhanced trust and accuracy in human-AI interactions.

Methodology

The research consists of several key contributions:

Dataset Construction: The authors compile a novel dataset of hedging expressions with human-annotated confidence scores, allowing for robust evaluation of LLM confidence mapping capabilities.
Figure 1: An illustration of the benchmark building process.
Lightweight Mapper: They propose a near-zero-cost mapper to translate hedging language into confidence scores, bypassing the expensive API costs associated with commercial LLMs.
Comprehensive Evaluation: The paper systematically analyzes LC across state-of-the-art LLMs and various QA benchmarks, highlighting both calibration (ECE) and discriminability (AUROC).
Fine-tuning Framework: An innovative framework is suggested to enhance LC reliability through supervised fine-tuning, further refining LLM's capacity to convey uncertainty naturally.

Results

The empirical results reveal several insights:

While the inherent LC capabilities of most LLMs remain suboptimal, strategic prompting significantly enhances performance.
The proposed mapper exhibits superior calibration and discriminability compared to existing methods, especially under carefully engineered prompts (LC+) that encourage hedging.
Figure 2: Left: Performance of GPT-5 on the SimpleQA dataset under reasoning vs. without reasoning. Right: Average ECE and AUROC over multiple models on SimpleQA.
Fine-tuning with LoRA on targeted datasets markedly improves LC, as evidenced by better alignment with human uncertainty assessments in diverse benchmarks.

Discussion and Implications

The paper underscores the transformative potential of LC in aligning LLM responses with human expectations of uncertainty. It suggests that hedging language not only enhances user trust but also offers a computationally efficient alternative to traditional confidence estimation methods. The research sets the stage for future exploration into more nuanced LC mappings and diversifying applications to multimodal and reasoning contexts.

Ultimately, the findings advocate for increased attention to human-centered approaches in AI system design, with LC serving as a viable path towards more transparent and trustworthy AI interactions.

Conclusion

This paper presents strong evidence that LLMs, when guided appropriately, can express uncertainty in ways that resonate with human communication habits. The advancements in datamapping and fine-tuning methodologies bolster the effectiveness of LC, paving the way for more human-aligned AI systems that can be deployed in sensitive and decision-critical applications. Further research is anticipated to expand these findings, enriching the dialogue between AI systems and their human users.