Bridge-Guided Evidence Calibration (BGEC)
- The paper introduces a double-calibration mechanism that integrates Bayesian-calibrated KG evidence with LLM predictions to enhance trustworthiness.
- BGEC leverages a lightweight proxy model with supervised fine-tuning and reinforcement learning to generate evidence paths and calibrated confidence scores.
- Experimental results on WebQSP and CWQ demonstrate that BGEC significantly reduces Expected Calibration Error while maintaining competitive accuracy and token efficiency.
Bridge-Guided Evidence Calibration (BGEC) is a core component within the DoublyCal framework for enhancing the trustworthiness of LLM reasoning, particularly in knowledge graph (KG)-augmented settings. BGEC enables quantification and propagation of epistemic uncertainty from retrieved or generated KG evidence into black-box LLM predictions. The mechanism operates by generating calibrated evidence—supplemented with principled confidence scores—via a proxy model, then bridging this external calibration directly into the LLM's context to anchor its self-reported confidence. This yields a system in which both the evidence and the final answer are accompanied by rigorously calibrated uncertainty estimates, significantly improving both accuracy and Expected Calibration Error (ECE) in knowledge-intensive tasks (Lu et al., 17 Jan 2026).
1. Double-Calibration Principle and Evidence Confidence Estimation
BGEC is anchored in the double-calibration paradigm. The first calibration stage focuses on quantifying evidence confidence for each piece of retrieved or generated KG evidence . Given a question and true answer set , let denote supporting evidence represented as a constrained relational path, with grounded candidates . Confidence that this evidence “contains” the true answer is computed by:
This Bayesian posterior mean results from a Beta-Bernoulli model with Jeffreys prior (), balancing observed evidence quality with prior uncertainty. The resulting evidence confidence is directly interpretable as the estimated probability that the provided KG path leads to the correct answer.
2. Lightweight Proxy Model for Inference-Time Evidence Calibration
The BGEC pipeline relies on a lightweight proxy model to produce both evidence candidates and their calibrated confidence at inference time, in contrast to retrospective or post-hoc calibration. The proxy, instantiated as a Llama2-7B model, undergoes the following two-phase training regimen:
- Supervised Fine-Tuning (SFT): The model is trained to output constrained KG paths along with their associated evidence confidence scores in the format . The SFT objective is standard autoregressive cross-entropy.
- Reinforcement Learning (RL) Refinement: Proxy outputs are further optimized with a compound reward function combining inferential quality () and calibration alignment (0):
1
with 2 and 3, where 4 measures path matching and 5 is a hyperparameter. Optimization is performed via Group Relative Policy Optimization (GRPO), maintaining consistency with the SFT initialization.
3. Bridging Calibrated Evidence into Black-Box LLMs
After proxy inference, BGEC "bridges" the top-6 pieces of evidence and their calibrated confidences into the LLM via prompt engineering. Each evidence-confidence pair is rendered into a text block:
3
These blocks are prepended to the LLM input, along with a verbalized uncertainty quantification (UQ) template instructing the LLM to answer and then provide its confidence within [0.0, 1.0]. This prompt design ensures the LLM observes and conditions on the proxy model’s explicit confidence scores for every evidence snippet it receives.
4. Final Prediction Step and Confidence Propagation
In contrast to attempts at explicit arithmetic fusion of external and LLM confidences, BGEC leverages a conditioning scheme: the LLM’s own self-reported confidence 7 is implicitly anchored by the externally calibrated evidence confidences 8. Final prediction is thus a tuple 9 with 0 reflecting both model uncertainty and the traceable reliability of the supporting evidence. Expected Calibration Error (ECE) over these tuples is computed in the standard manner.
Empirically, this double-calibration chain achieves ECE 1 5% on the WebQSP dataset, as opposed to 2 10% for single-stage approaches, with F1 and recall comparable or superior to prior KG-augmented LLM baselines.
| Reasoner | F1 (WebQSP) | ECE ↓ (WebQSP) |
|---|---|---|
| SubgraphRAG | 77.3% | 11.1% |
| SFT-DoublyCal | 72.6% | 3.1% |
| RL-DoublyCal | 76.7% | 4.5% |
Reliability diagrams indicate that BGEC-achieved confidences align more closely with the ideal calibration line.
5. End-to-End Pipeline and Token Efficiency
The overall BGEC-empowered DoublyCal pipeline can be summarized as:
4
This structure enables token-efficient inference: BGEC matches or surpasses the F1 of relational-path RAG (RoG) with only ~10% more tokens and utilizes approximately 39% of the tokens used by SubgraphRAG.
6. Experimental Results and Significance
BGEC’s performance has been validated on the WebQSP and CWQ benchmarks using Hits@1, Recall, macro-F1, and ECE. BGEC-driven DoublyCal achieves state-of-the-art calibration and highly competitive accuracy while drastically reducing ECE relative to all single-stage and prior multi-hop KG-augmented methods. The empirical reduction of ECE by more than half, along with token efficiency, substantiates the utility of bridging calibrated evidence confidence into LLM reasoning.
A plausible implication is that BGEC establishes a robust template for uncertainty quantification in future KG-augmented reasoning systems. By ensuring that LLM confidence is explicitly driven by and traceable to evidence quality, BGEC enhances trustworthiness, interpretability, and practical reliability of automated knowledge-intensive systems (Lu et al., 17 Jan 2026).