Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

60 tokens/sec

Gemini 2.5 Pro Pro

44 tokens/sec

o3 Pro

8 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

201 1

Linguistic Calibration of Long-Form Generations (2404.00474v2)

Published 30 Mar 2024 in cs.LG, cs.AI, cs.CL, and stat.ML

Abstract: LLMs (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.

PDF HTML Abstract

Overview of the Linguistic Calibration of LLMs

The paper entitled "Linguistic Calibration of LLMs" addresses the pivotal problem of LLMs (LMs) leading users to suboptimal decisions due to confident hallucinations. This issue arises prominently when LMs output information that seems sure of its accuracy but is, in fact, incorrect. The paper introduces the concept of "linguistic calibration," targeting the alignment of expressed confidence in LM outputs with the actual probability of correctness, especially in long-form textual outputs that influence decision-making processes.

Definition and Framework for Linguistic Calibration

The authors propose a formal definition of linguistic calibration, centered on facilitating users to make probabilistic forecasts based on LM outputs that are appropriately aligned with the model's true level of certainty. This involves structuring LM training to include processes that convey confidence levels, using statements like "I estimate a 30% chance of..." in natural language that aligns with the actual likelihood of correctness.

The training framework designed to achieve this involves a two-step process:

Summary Distillation: A supervised finetuning step that consolidates multiple generations into a coherent summary reflecting varied confidence statements.
Decision-Based Reinforcement Learning (RL): An RL step encouraging LMs to output text enabling user-level calibration suitable for downstream decision tasks. This stage applies proper scoring rules typical in decision theory to align the LMs output with the true knowledge base when faced with a decision-making process.

Evaluation and Results

The paper's empirical evaluations focused on finetuning the Llama 2 7B LLM using this framework, showing significant improvements over baselines finetuned for factuality. The linguistic calibration approach improved calibration on both automated and human-evaluated metrics without sacrificing accuracy. The model also demonstrated zero-shot transferability across various tasks, performing well on both in-domain and out-of-distribution question-answer datasets, as well as a separate task of biography generation.

Notably, the model showed improved forecast Expected Calibration Error (ECE) while maintaining competitive prediction accuracy. This supports the efficacy of the proposed framework in practical applications where LM outputs influence decision-making.

Implications and Future Directions

This research has relevant practical implications. By aligning the model's stated confidence levels with its actual correctness probabilities, LMs can foster better trust and reliability in applications ranging from medical and legal decision support systems to everyday queries. The paper opens the possibility of widespread adoption of linguistic calibration in enhancing the interpretability and trustworthiness of LMs, especially for end-users who rely on models for critical information and decisions.

Looking forward, future developments could enhance user-specific calibrations, allowing adjustments tailored to individual user profiles or situational contexts. Moreover, refining the understanding of human interpretations of linguistic confidence could inform more nuanced calibrations, fostering better LM-user interactions.

In summary, this paper advances the field by addressing the interpretability of LMs through linguistic calibration, promoting an integrative approach that aligns LM outputs more closely with reality, and thereby fostering informed and reliable decision-making processes.

PDF Markdown Bookmark Chat (Pro)

References (72)

Authors (4)

Neil Band (9 papers)
Xuechen Li (35 papers)
Tengyu Ma (117 papers)
Tatsunori Hashimoto (80 papers)

Citations (13)

View on Semantic Scholar

Tweets

https://twitter.com/neilbband/status/1798770554109088253

https://twitter.com/neilbband/status/1816209370226516452

https://twitter.com/niloofar_mire/status/1826053802744754346

https://twitter.com/anmorgan2414/status/1897693823113085311

https://twitter.com/mctalentowen/status/1798925020410060821

https://twitter.com/StatMLPapers/status/1775239019758035299

YouTube

Show All Videos