Confidence-Supervised Fine-Tuning
- Confidence-Supervised Fine-Tuning is a method that uses model-derived confidence metrics—such as probabilities, verbal tokens, or set-valued predictions—to regulate fine-tuning objectives and improve calibration.
- CSFT integrates self-referential confidence, tokenized Brier scoring, and conformal prediction to guide loss functions, thereby enhancing robustness and generalization in domains like language modeling, vision, and recommendation.
- Empirical evaluations show that CSFT techniques yield significant gains in accuracy, certified robustness, and interpretability, with measurable improvements in metrics like ECE, AUROC, and Recall@K.
Confidence-Supervised Fine-Tuning (CSFT) encompasses a class of methods in which a model’s own confidence estimates or explicit confidence signals are incorporated directly into the fine-tuning objective, with the goal of improving calibration, interpretability, robustness, and generalization without requiring additional external supervision. CSFT techniques formalize confidence either as (1) model-internal probabilities (e.g., softmax outputs or predictive uncertainties), (2) verbalized expressions of uncertainty, or (3) confidence-calibrated set-valued predictions. These signals are then used to regularize, supervise, or entirely drive the fine-tuning loss, spanning applications in language modeling, recommendation, robustness, and beyond. Theoretical justification for these methods is provided by frameworks such as reward-weighted regression, proper scoring rules, and conformal prediction.
1. Core Principles of Confidence-Supervised Fine-Tuning
The defining characteristic of CSFT is the use of confidence information—derived either from the model itself or generated via proxy—as a direct supervisory target or as a weighting factor in the loss function:
- Self-referential confidence: Techniques such as Reinforcement Learning via Self-Confidence (RLSC) utilize the model’s own generation probability to score the confidence of outputs, thus enabling label-free reinforcement-type updates (Li et al., 5 Jun 2025).
- Verbalized confidence calibration: ConfTuner and other verbal calibration methods require models to output an explicit confidence token or phrase, which is then matched against correctness using a proper scoring rule such as the tokenized Brier score (Li et al., 26 Aug 2025).
- Confidence-weighted losses: Dynamic Fine-Tuning (DFT) and Anchored SFT weigh demonstration log-likelihoods by the model’s own predictive probability, resulting in tighter RL lower bounds but, without regularization, can suffer from distributional drift (Zhu et al., 28 Sep 2025).
- Confidence-driven selection: In vision and recommendation domains, CSFT methods apply or mask losses according to confidence thresholds, filtering out hallucinated or low-value predictions during training (Jang et al., 2024, Wang et al., 2024).
The motivation across these approaches is to align the learning signal—usually via RL, SFT, or their hybrids—with the model’s internal epistemic or aleatoric uncertainty, thus improving trustworthiness and actionable calibration.
2. Methodological Variants and Theoretical Frameworks
2.1 Internal Confidence as Reward or Weight
- RLSC: Uses as a reward for REINFORCE updates or, equivalently, as sampling weights in a surrogate cross-entropy loss:
This sharpens the output distribution around high-confidence modes (Li et al., 5 Jun 2025).
- Dynamic Fine-Tuning / Reward-Weighted Regression: The loss is reweighted by , leading to a tight lower bound on RL objectives:
While theoretically appealing, DFT can cause drift, which Anchored SFT counteracts by adding a KL regularizer to a reference model (Zhu et al., 28 Sep 2025).
2.2 Calibration by Proper Scoring Rules
- ConfTuner: Introduces a tokenized Brier score over discretized confidence levels:
where is the softmax over confidence tokens and is correctness. This loss is provably proper in the sense that the optimal model outputs the true probability of correctness rounded to the closest quantization (Li et al., 26 Aug 2025).
2.3 Conformal and Set-Valued Calibration
- CPFT (Recommendation): Integrates conformal prediction by regularizing the size and proximity of prediction sets, thus aligning confidence with empirical set coverage:
where set size and distance are differentiable proxies for predictive confidence and coverage (Wang et al., 2024).
3. Application Domains
3.1 LLMs and Reasoning
CSFT in LLMs spans synthetic confidence supervision, self-referential reward, and explicit verbal calibration:
- Self-confidence as RL reward: RLSC applied to Qwen2.5-Math-7B with few-shot samples per prompt and only 10–20 training steps resulted in +13.4–21.7 point accuracy gains across AIME2024, MATH500, Minerva Math, Olympiadbench, and AMC23 (Li et al., 5 Jun 2025).
- Verbal calibration with emergent behaviors: CSFT supervision on output confidence tokens elicits self-verification and adaptive chain-of-thought behaviors (e.g., longer reasoning traces for low-confidence queries), with large improvements in AUROC and ECE on GSM8K, ARC-Challenge, and MATH-500 (Jang et al., 4 Jun 2025).
- Calibration for self-correction and cascading: ConfTuner enhances calibration of LLMs’ verbal confidence, with substantial improvements in ECE (e.g., LLaMA ECE reduced from 0.4803 to 0.0405) and measured gains in downstream self-correction and model cascading (Li et al., 26 Aug 2025).
3.2 Vision and Robustness
- FT-CADIS for certified robustness: Applies confidence-aware sample selection and loss masking to denoised inputs, resulting in state-of-the-art certified accuracy and average certified radius on CIFAR-10 and ImageNet-1K. For example, ImageNet accuracy @ improved by +6.2 percentage points over the baseline (Jang et al., 2024).
3.3 Recommendation Systems
- CPFT: Confidence calibration and set-regularization in sequential recommender models result in an average lift of +4.55% in Recall@K and similar improvements in NDCG across five Amazon datasets (Wang et al., 2024).
4. Practical Protocols and Training Recipes
Representative CSFT procedures share several common elements while diverging in the definition and use of confidence:
| Method | Confidence Signal | Loss Construction | Domain |
|---|---|---|---|
| RLSC | REINFORCE / weighted CE | LLM, math reasoning | |
| ConfTuner | Verbal token (discrete) | Tokenized Brier score | LLM, reasoning QA |
| Anchored SFT | Weighted log-likelihood + KL | LLM, code, med QA | |
| FT-CADIS | Masked CE on selected samples | Vision, robustness | |
| CPFT | Normalized ranking score | CE + set-size/distance losses | Recommendation |
| CSFT (CoT) | Synthetic scalar (ensemble) | CE on <confidence> tokens |
LLM, chain-of-thought |
Explicit detail is maintained throughout each protocol, e.g., masking only the confidence token positions in LLMs (Jang et al., 4 Jun 2025), or projecting only a LoRA-rank subspace for efficient fine-tuning (Jang et al., 2024, Li et al., 26 Aug 2025).
5. Empirical Outcomes and Evaluation Metrics
Systematic evaluation relies on both task-oriented metrics (accuracy, recall, NDCG) and calibration-specific metrics (ECE, Brier score, AUROC):
- LLMs: RLSC yields sizable accuracy gains, e.g., +21.2 pts on MATH500 (Li et al., 5 Jun 2025). ConfTuner reduces ECE (LLama-3.1-8B-Instruct: 0.4803 → 0.0405) and raises AUROC (LLaMA: 0.5884 → 0.7383) (Li et al., 26 Aug 2025).
- Robustness: FT-CADIS increases certified accuracy on ImageNet from 60.0% to 66.2% and average certified radius from 0.743 to 1.001 (Jang et al., 2024).
- Recommendation: CPFT increases Recall@10 by +5.0% and Recall@50 by +6.2% (Wang et al., 2024).
- Emergent properties: Self-verification rates in LLMs increase from <1.5% to ~20% overall (and nearly 100% in the lowest-confidence bin) after CSFT (Jang et al., 4 Jun 2025).
6. Limitations, Open Problems, and Future Directions
Current CSFT methods demonstrate consistent improvements across domains but exhibit several limitations and open research questions:
- Domain specificity: Most protocols have been evaluated on structured reasoning, mathematics, or specific vision tasks; generalization to open-ended or conversational domains remains open (Li et al., 5 Jun 2025).
- Sensitivity to base calibration: Label-free methods such as RLSC depend on a well-calibrated pretrained base; bootstrapping from weaker models may be less effective (Li et al., 5 Jun 2025).
- Potential over-sharpening: Mode-sharpening objectives can cause overconfidence or loss of diversity if not properly regularized (e.g., via dynamic annealing of or sample count) (Li et al., 5 Jun 2025).
- Reliance on synthetic or proxy labels: Verbalized confidence CSFT and ConfTuner depend on either model-sampled or automated correctness labels, which may propagate bias (Li et al., 26 Aug 2025).
- Theoretical–practical gap: Although proper scoring rules and RWR provide strong guarantees, true real-world calibration depends on factors including optimization, data quality, and architectural variability (Li et al., 26 Aug 2025).
- Rich expression: Extending CSFT to free-form, conversational, or multi-modal uncertainty expression is an ongoing challenge (Li et al., 26 Aug 2025).
A plausible implication is that continuous progress in CSFT will depend on grounded theoretical analysis (e.g., RWR, proper scoring rules), scalable implementations (e.g., LoRA, batch recalibration), and comprehensive evaluation on diverse, real-world data distributions.
7. Connections and Synthesis Across Domains
CSFT methods unify concepts across reinforcement learning, Bayesian calibration, and supervised adaptation:
- RL–Supervised hybridization: Anchored SFT (ASFT) demonstrates that a reward-weighted regression lower-bound, tightly linked to RL principles, can be stably enforced within a supervised framework by anchoring to a reference model via KL divergence, closing much of the gap to full RL at low compute (Zhu et al., 28 Sep 2025).
- Trustworthy deployment: ConfTuner and related methods position confidence calibration as essential for safe LLM deployment in high-stakes applications (science, law, healthcare), with improved self-correction and model cascading enabled by accurately verbalized confidence (Li et al., 26 Aug 2025).
- Interpretability: CSFT techniques that supervise confidence expression induce not only better-calibrated outputs but also emergent behaviors (self-verification, reasoning-length correlation) that enhance model interpretability (Jang et al., 4 Jun 2025).
Taken together, Confidence-Supervised Fine-Tuning provides a principled and empirically validated set of tools for aligning model predictions with their true epistemic certainty, improving reliability, accuracy, and insight across a spectrum of AI tasks.