UNCLE: Uncertainty Expressions in Long-Form Generation (2505.16922v1)

Published 22 May 2025 in cs.CL

Abstract: LLMs are prone to hallucination, particularly in long-form generations. A promising direction to mitigate hallucination is to teach LLMs to express uncertainty explicitly when they lack sufficient knowledge. However, existing work lacks direct and fair evaluation of LLMs' ability to express uncertainty effectively in long-form generation. To address this gap, we first introduce UNCLE, a benchmark designed to evaluate uncertainty expression in both long- and short-form question answering (QA). UNCLE spans five domains and comprises 4k long-form QA instances and over 20k short-form QA pairs. Our dataset is the first to directly bridge short- and long-form QA with paired questions and gold-standard answers. Along with the benchmark, we propose a suite of new metrics to assess the models' capabilities to selectively express uncertainty. Using UNCLE, we then demonstrate that current models fail to convey uncertainty appropriately in long-form generation. We further explore both prompt-based and training-based methods to improve models' performance, with the training-based methods yielding greater gains. Further analysis of alignment gaps between short- and long-form uncertainty expression highlights promising directions for future research using UNCLE.

Summary

An Overview of UNCLE: Evaluating Uncertainty in Long-Form Generation

The paper "UNCLE: Uncertainty Expressions in Long-Form Generation" addresses a significant gap in evaluating the capabilities of LLMs regarding their ability to express uncertainty in text generation. Recognized for their strong generative abilities, LLMs frequently hallucinate, generating false or fabricated information, most prominently observed during long-form generation tasks. The paper introduces UNCLE, a benchmark specifically designed to evaluate uncertainty expression in both long-form and short-form question answering (QA). The dataset spans multiple domains, including biographies, companies, movies, astronomical objects, and diseases, comprising 4,000 long-form QA instances and over 20,000 short-form QA pairs.

UNCLE represents a pioneering approach to explicitly bridge the evaluation of long- and short-form question answering through paired questions. The benchmark also introduces new metrics aimed at a comprehensive evaluation of models' ability to express uncertainty selectively. The authors reveal that current LLMs exhibit limitations in effectively conveying uncertainty for unknown facts, regardless of their ability to provide accurate answers when the model has knowledge. The insights demonstrate a discrepancy between closed-source models, which use uncertainty expressions more frequently, and open-source models, which express them more accurately.

Contributions and Findings

The introduction of UNCLE is a major contribution that creates a methodological baseline for uncertainty assessment across QA formats. A key finding is that although models can provide correct answers for known information, they fail to accurately express uncertainty for unknown information. Additionally, the paper explores the alignment gaps in uncertainty expression between short- and long-form QA, highlighting key areas where improvement is needed.

Methodologies for Enhancing Uncertainty Expression

The paper investigates various prompt-based and training-based methods for enhancing models' ability to express uncertainty. Findings suggest that training-based methods yield superior results over prompt-based methods, indicating a potential path for improving LLM performance. Training on long-form tasks benefits short-form tasks, by enabling LLMs to adopt a more nuanced understanding of uncertainty in simpler contexts.

Implications and Future Prospects

The implications of this research extend to both practical and theoretical domains. Practically, enhancing LLMs' ability to express uncertainty can significantly improve the reliability of automated systems in real-world applications such as medical diagnosis, legal advice, and other critical decision-making processes. Theoretically, this work pioneers a novel analytical framework for studying uncertainty in AI models, providing a foundation for future research in this important area.

Given the paper's findings, future developments should focus on improving the consistency of uncertainty expressions across different QA forms, reducing misalignment, and exploring novel training methodologies to optimize this aspect of performance. The UNCLE benchmark provides a significant tool for driving advancements in AI's capability to express uncertainty, reducing the risk of misinformation and enhancing trust in automated systems.

By addressing the uncertainty aspects in long-form generation, this paper contributes to refining the robustness of LLMs, ensuring more accurate and reliable AI-generated content across diverse applications.