- The paper shows that RLCR, by incorporating the Brier score, effectively balances answer correctness with calibrated confidence estimation.
- Empirical results demonstrate that RLCR significantly improves calibration metrics (e.g., in-domain ECE reduction from 0.37 to 0.03) while maintaining accuracy across multiple datasets.
- The study highlights robust performance on OOD benchmarks and proposes structured reasoning outputs with explicit confidence tags for enhanced uncertainty reasoning.
Training LLMs to Reason About Their Uncertainty: RLCR
This paper introduces RLCR (Reinforcement Learning with Calibration Rewards), a method for training LMs to jointly optimize for both answer correctness and calibrated confidence estimation during chain-of-thought (CoT) reasoning. The approach addresses the well-documented issue that standard RL-based reasoning training with binary correctness rewards (RLVR) leads to overconfident and poorly calibrated models, especially in out-of-distribution (OOD) settings. RLCR augments the reward function with a proper scoring rule (specifically, the Brier score), incentivizing models to output both an answer and a calibrated confidence estimate. Theoretical analysis and extensive empirical results demonstrate that RLCR achieves strong accuracy while substantially improving calibration, outperforming both standard RL and post-hoc confidence estimation baselines.
Theoretical Framework
The RLCR objective is defined as:
1
|
RLCR(y, q, y*) = 𝟙_{y ≡ y*} - (q - 𝟙_{y ≡ y*})^2 |
where
y is the model's answer,
q is its verbalized confidence, and
y∗ is the ground truth. The first term rewards correctness, while the second penalizes miscalibration via the Brier score. The paper proves that, for any bounded proper scoring rule, this reward is maximized when the model outputs the most likely correct answer and a confidence matching the true probability of correctness. Notably, the log-loss, while a proper scoring rule, is unbounded and does not satisfy the correctness incentive in this context.
Empirical Evaluation
Experimental Setup
- Base Model: Qwen2.5-7B, a strong open-source LM.
- Training: RL with GRPO, no KL regularization, format rewards to enforce structured outputs with > , <answer>, <analysis>, and <confidence> tags.
Datasets: HotPotQA (multi-hop QA with distractors), Big-Math (math reasoning), and a suite of OOD benchmarks (SimpleQA, TriviaQA, CommonsenseQA, GPQA, MATH500, GSM8K).
- Baselines: RLVR, RLVR with post-hoc confidence classifiers (BCE and Brier), linear probes, answer token probabilities, and SFT+RLCR (SFT warmup before RLCR).
Main Results
Method |
In-Domain Acc. |
In-Domain ECE |
OOD Acc. |
OOD ECE |
Base |
39.7% |
0.53 |
53.3% |
0.40 |
RLVR |
63.0% |
0.37 |
53.9% |
0.46 |
RLVR + BCE Classifier |
63.0% |
0.07 |
53.9% |
0.24 |
RLVR + Brier |
63.0% |
0.09 |
53.9% |
0.33 |
RLVR + Probe |
63.0% |
0.10 |
53.9% |
0.38 |
Answer Prob |
63.0% |
0.36 |
53.9% |
0.42 |
RLCR (ours) |
62.1% |
0.03 |
56.2% |
0.21 |
- Calibration: RLCR reduces in-domain ECE from 0.37 (RLVR) to 0.03 and OOD ECE from 0.46 (RLVR) to 0.21, with no loss in accuracy.
- OOD Generalization: RLVR degrades calibration OOD, while RLCR improves it, outperforming both the base model and all post-hoc classifier baselines.
- Math Reasoning: On Big-Math, RLCR and SFT+RLCR achieve the best calibration, with SFT+RLCR slightly reducing OOD accuracy due to catastrophic forgetting.
- Test-Time Scaling: Confidence-weighted majority voting and ensembling verbalized confidences further improve both accuracy and calibration, leveraging the model's own uncertainty estimates.
Analysis of Reasoning and Calibration
- Reasoning Chains: RLCR-trained models generate explicit uncertainty analyses, leading to more informative and calibrated confidence scores, especially for smaller models where classifier capacity is limited.
- Self-Consistency: RLCR models exhibit low variance in confidence estimates across multiple reasoning chains for the same answer, and distribute confidence more appropriately across mutually exclusive answers, though some overconfidence persists OOD.
- Failure Modes: Despite improvements, RLCR models can still assign high confidence to multiple contradictory answers in OOD settings, indicating remaining challenges in robust uncertainty estimation.
Implementation Considerations
- Reward Design: The calibration term must use a bounded proper scoring rule (e.g., Brier score) to ensure joint optimization of accuracy and calibration. Unbounded rules (e.g., log-loss) can incentivize degenerate solutions.
- Prompt Engineering: Structured output formats with explicit tags for reasoning, answer, analysis, and confidence are essential for reliable extraction and evaluation.
- Training Dynamics: RLCR requires careful balancing of reward terms and may benefit from SFT warmup for domains where uncertainty analysis is complex (e.g., math).
- Evaluation: Calibration metrics (ECE, Brier score, AUROC) should be reported alongside accuracy, both in-domain and OOD, to fully characterize model reliability.
Implications and Future Directions
RLCR demonstrates that calibration can be directly optimized during RL-based reasoning training, yielding models that are both accurate and reliable in their uncertainty estimates. This is particularly relevant for high-stakes applications (e.g., healthcare, law) where overconfident errors are unacceptable. The approach is compatible with existing RL pipelines and can be extended to other proper scoring rules, provided they are bounded.
Open questions and future research directions include:
- Improving OOD Calibration: While RLCR improves OOD calibration, absolute errors remain high. Further work is needed on regularization, data augmentation, or meta-learning approaches to enhance robustness.
- Scaling to Larger Models: Investigating the scaling behavior of RLCR with larger LMs and more complex tasks.
- Integration with Abstention and Selective Prediction: Combining RLCR with abstention mechanisms to allow models to defer when uncertain.
- Faithfulness of Uncertainty Reasoning: Ensuring that generated uncertainty analyses are causally linked to confidence estimates and not merely post-hoc rationalizations.
In summary, RLCR provides a theoretically principled and empirically validated framework for training LMs to reason about their own uncertainty, setting a new standard for reliable, calibrated LLM reasoning.