Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty (2507.16806v1)

Published 22 Jul 2025 in cs.LG, cs.AI, and cs.CL

Abstract: When LMs are trained via reinforcement learning (RL) to generate natural language "reasoning chains", their performance improves on a variety of difficult question answering tasks. Today, almost all successful applications of RL for reasoning use binary reward functions that evaluate the correctness of LM outputs. Because such reward functions do not penalize guessing or low-confidence outputs, they often have the unintended side-effect of degrading calibration and increasing the rate at which LMs generate incorrect responses (or "hallucinate") in other problem domains. This paper describes RLCR (Reinforcement Learning with Calibration Rewards), an approach to training reasoning models that jointly improves accuracy and calibrated confidence estimation. During RLCR, LMs generate both predictions and numerical confidence estimates after reasoning. They are trained to optimize a reward function that augments a binary correctness score with a Brier score -- a scoring rule for confidence estimates that incentivizes calibrated prediction. We first prove that this reward function (or any analogous reward function that uses a bounded, proper scoring rule) yields models whose predictions are both accurate and well-calibrated. We next show that across diverse datasets, RLCR substantially improves calibration with no loss in accuracy, on both in-domain and out-of-domain evaluations -- outperforming both ordinary RL training and classifiers trained to assign post-hoc confidence scores. While ordinary RL hurts calibration, RLCR improves it. Finally, we demonstrate that verbalized confidence can be leveraged at test time to improve accuracy and calibration via confidence-weighted scaling methods. Our results show that explicitly optimizing for calibration can produce more generally reliable reasoning models.

Summary

The paper shows that RLCR, by incorporating the Brier score, effectively balances answer correctness with calibrated confidence estimation.
Empirical results demonstrate that RLCR significantly improves calibration metrics (e.g., in-domain ECE reduction from 0.37 to 0.03) while maintaining accuracy across multiple datasets.
The study highlights robust performance on OOD benchmarks and proposes structured reasoning outputs with explicit confidence tags for enhanced uncertainty reasoning.

Training LLMs to Reason About Their Uncertainty: RLCR

This paper introduces RLCR (Reinforcement Learning with Calibration Rewards), a method for training LMs to jointly optimize for both answer correctness and calibrated confidence estimation during chain-of-thought (CoT) reasoning. The approach addresses the well-documented issue that standard RL-based reasoning training with binary correctness rewards (RLVR) leads to overconfident and poorly calibrated models, especially in out-of-distribution (OOD) settings. RLCR augments the reward function with a proper scoring rule (specifically, the Brier score), incentivizing models to output both an answer and a calibrated confidence estimate. Theoretical analysis and extensive empirical results demonstrate that RLCR achieves strong accuracy while substantially improving calibration, outperforming both standard RL and post-hoc confidence estimation baselines.

Theoretical Framework

The RLCR objective is defined as:

1	RLCR(y, q, y) = 𝟙_{y ≡ y} - (q - 𝟙_{y ≡ y*})^2

where

y

is the model's answer,

q

is its verbalized confidence, and

y^*

is the ground truth. The first term rewards correctness, while the second penalizes miscalibration via the Brier score. The paper proves that, for any bounded proper scoring rule, this reward is maximized when the model outputs the most likely correct answer and a confidence matching the true probability of correctness. Notably, the log-loss, while a proper scoring rule, is unbounded and does not satisfy the correctness incentive in this context.

Empirical Evaluation

Experimental Setup

Base Model: Qwen2.5-7B, a strong open-source LM.

Training: RL with GRPO, no KL regularization, format rewards to enforce structured outputs with > , <answer>, <analysis>, and <confidence> tags.

Datasets: HotPotQA (multi-hop QA with distractors), Big-Math (math reasoning), and a suite of OOD benchmarks (SimpleQA, TriviaQA, CommonsenseQA, GPQA, MATH500, GSM8K).

Baselines: RLVR, RLVR with post-hoc confidence classifiers (BCE and Brier), linear probes, answer token probabilities, and SFT+RLCR (SFT warmup before RLCR).

Main Results
Method In-Domain Acc. In-Domain ECE OOD Acc. OOD ECE

Base 39.7% 0.53 53.3% 0.40

RLVR 63.0% 0.37 53.9% 0.46

RLVR + BCE Classifier 63.0% 0.07 53.9% 0.24

RLVR + Brier 63.0% 0.09 53.9% 0.33

RLVR + Probe 63.0% 0.10 53.9% 0.38

Answer Prob 63.0% 0.36 53.9% 0.42

RLCR (ours) 62.1% 0.03 56.2% 0.21

Calibration: RLCR reduces in-domain ECE from 0.37 (RLVR) to 0.03 and OOD ECE from 0.46 (RLVR) to 0.21, with no loss in accuracy.

OOD Generalization: RLVR degrades calibration OOD, while RLCR improves it, outperforming both the base model and all post-hoc classifier baselines.

Math Reasoning: On Big-Math, RLCR and SFT+RLCR achieve the best calibration, with SFT+RLCR slightly reducing OOD accuracy due to catastrophic forgetting.

Test-Time Scaling: Confidence-weighted majority voting and ensembling verbalized confidences further improve both accuracy and calibration, leveraging the model's own uncertainty estimates.

Analysis of Reasoning and Calibration

Reasoning Chains: RLCR-trained models generate explicit uncertainty analyses, leading to more informative and calibrated confidence scores, especially for smaller models where classifier capacity is limited.

Self-Consistency: RLCR models exhibit low variance in confidence estimates across multiple reasoning chains for the same answer, and distribute confidence more appropriately across mutually exclusive answers, though some overconfidence persists OOD.

Failure Modes: Despite improvements, RLCR models can still assign high confidence to multiple contradictory answers in OOD settings, indicating remaining challenges in robust uncertainty estimation.

Implementation Considerations

Reward Design: The calibration term must use a bounded proper scoring rule (e.g., Brier score) to ensure joint optimization of accuracy and calibration. Unbounded rules (e.g., log-loss) can incentivize degenerate solutions.

Prompt Engineering: Structured output formats with explicit tags for reasoning, answer, analysis, and confidence are essential for reliable extraction and evaluation.

Training Dynamics: RLCR requires careful balancing of reward terms and may benefit from SFT warmup for domains where uncertainty analysis is complex (e.g., math).

Evaluation: Calibration metrics (ECE, Brier score, AUROC) should be reported alongside accuracy, both in-domain and OOD, to fully characterize model reliability.

Implications and Future Directions

RLCR demonstrates that calibration can be directly optimized during RL-based reasoning training, yielding models that are both accurate and reliable in their uncertainty estimates. This is particularly relevant for high-stakes applications (e.g., healthcare, law) where overconfident errors are unacceptable. The approach is compatible with existing RL pipelines and can be extended to other proper scoring rules, provided they are bounded.

Open questions and future research directions include:

Improving OOD Calibration: While RLCR improves OOD calibration, absolute errors remain high. Further work is needed on regularization, data augmentation, or meta-learning approaches to enhance robustness.

Scaling to Larger Models: Investigating the scaling behavior of RLCR with larger LMs and more complex tasks.

Integration with Abstention and Selective Prediction: Combining RLCR with abstention mechanisms to allow models to defer when uncertain.

Faithfulness of Uncertainty Reasoning: Ensuring that generated uncertainty analyses are causally linked to confidence estimates and not merely post-hoc rationalizations.

In summary, RLCR provides a theoretically principled and empirically validated framework for training LMs to reason about their own uncertainty, setting a new standard for reliable, calibrated LLM reasoning.

Method	In-Domain Acc.	In-Domain ECE	OOD Acc.	OOD ECE
Base	39.7%	0.53	53.3%	0.40
RLVR	63.0%	0.37	53.9%	0.46
RLVR + BCE Classifier	63.0%	0.07	53.9%	0.24
RLVR + Brier	63.0%	0.09	53.9%	0.33
RLVR + Probe	63.0%	0.10	53.9%	0.38
Answer Prob	63.0%	0.36	53.9%	0.42
RLCR (ours)	62.1%	0.03	56.2%	0.21

PDF Markdown

Follow-up Questions

Related Papers

Authors (7)

Tweets

https://twitter.com/ishapuri101/status/1948073941492474101

https://twitter.com/MehulDamani2/status/1948069191031882036

https://twitter.com/fly51fly/status/1948142730951160319

https://twitter.com/arxivsanitybot/status/1948380811016679503

https://twitter.com/beth10290/status/1948919168704684312

Reddit

Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty (16 points, 0 comments)