Bayesian Coherence Coefficient (BCC)
- BCC is a metric that quantifies the alignment between expected and observed log-odds updates in LLMs, serving as a measure of Bayesian coherence.
- It isolates the belief-updating mechanism by comparing model-generated evidential updates to ideal Bayesian log-odds derived from prior and likelihood assessments.
- Empirical results show that larger model capacities yield higher BCC values, highlighting improvements in model alignment and implications for AI safety.
The Bayesian Coherence Coefficient (BCC) is a quantitative metric introduced to assess the degree to which LLMs update their in-context credences in a manner consistent with Bayes’ theorem. Unlike conventional performance metrics that benchmark zero-shot accuracy or calibration, the BCC isolates the core mechanism of belief-updating under new evidence, quantifying the coherence between expected and observed log-odds updates for model predictions. BCC provides a rigorous, scale-sensitive lens on whether LLMs’ conditional probability assignments and sequential updates approximate Bayesian rationality, with implications for the predictability, alignment, and safety of next-generation AI systems (Imran et al., 23 Jul 2025).
1. Mathematical Formalism and Definition
The BCC is grounded in Bayes’ theorem over a set of mutually exclusive and exhaustive classes . Given evidence and prior information encapsulated in a conversation history , an ideal Bayesian agent forms updated credences via:
For two classes , it is expedient to work with log-odds ratios:
The Bayesian Coherence Coefficient compares:
- Expected log-odds update (Δ_expected):
- Observed log-odds update (Δ_observed):
Here, denotes the LLM's conditional token probabilities. The BCC is then defined as the Pearson correlation across an evaluation dataset :
0
A BCC of 1 corresponds to perfect linear coherence with Bayes’ rule, while 2 denotes random or uncorrelated updates (Imran et al., 23 Jul 2025).
2. Theoretical Foundations and Significance
Bayesian rationality requires that posterior log-odds shifts match the evidence’s log likelihood ratio. The BCC operationalizes this principle by quantifying alignment between the theoretically required and empirically inferred log-odds updates in LLMs. Unlike calibration metrics or accuracy scores, BCC directly measures the pure updating step, disregarding baseline credence distributions or marginal likelihoods.
If observed and expected updates align perfectly (points fall on the 45° line in scatter plots), the model is Bayes-optimal. Deviations—where updates are systematically weaker (under-updating) or stronger (over-updating)—are reflected in gradients and correlation strengths. High BCC thus indicates that the model’s credence-shifting mechanism is internally consistent and normatively sound, whereas low BCC reveals incoherence or heuristic-driven updating.
3. Dataset and Evaluation Procedure
Evaluation of BCC requires a dataset 3 comprising tuples 4, systematically constructed for discriminative coverage:
- Ten semantic categories (e.g., schools of philosophy, genres of music), each with at least five candidate classes of equal token length (≤3 tokens).
- For each category, 20 evidence strings, labeled for which classes they support.
- Three conversation histories per category, varying in relevance.
- Evidence and class labels are generated via GPT-4o with controlled prompts to ensure linguistic richness and comparability.
This design yields approximately 6460 distinct four-tuples, sufficient for robust statistical analysis of correlation metrics like BCC (Imran et al., 23 Jul 2025).
4. Experimental Protocol
For each pre-trained LLM 5 and evaluation tuple:
- Prior elicitation: Concatenate a fixed class_elicitation string and class label 6 with history 7 and record 8.
- Likelihood elicitation: Present 9 and the phrase “Given that the correct class is 0,” then add the evidence string 1; record 2.
- Posterior elicitation: Provide 3, the class_elicitation string, 4, and the evidence string 5; record 6.
Each probability is elicited via separate, stateless API calls at temperature 7 to preclude state leakage. 8 and 9 are calculated for all pairs and pooled for BCC computation via Pearson’s 0 over the dataset.
5. Empirical Findings
Main results highlight strong positive scaling of BCC with model size and benchmark capability:
- All evaluated pre-trained models (GPT-2, Pythia, Llama 3, Falcon 3, Qwen 2.5) achieve BCC 1, indicating systematic—if imperfect—Bayesian updating.
- BCC correlates strongly with log(model parameter count), with a fitted 2 (3), and improves with capacity within model families. For instance, Pythia 0.21B yields BCC 4, while Pythia 12B attains 5.
- Models consistently under-update: observed–expected update gradient 6, but this approaches unity at larger scales.
- Positive but statistically non-significant increases in BCC with training step count are observed within the Pythia family.
- BCC correlates positively with performance on most standard benchmarks (BIG-Bench Hard, GPQA, MMLU-PRO, Math 5; 7), with weaker trends on others (IFEval, MuSR).
Table: Exemplary BCC Statistics for Pythia Models
| Model | BCC | Update Gradient | Direction Agreement (%) |
|---|---|---|---|
| Pythia 0.21B | 0.505 | 0.340 | 63.7 |
| Pythia 12B | 0.681 | 0.396 | 73.7 |
6. Interpretation, Implications, and Limitations
The results support the hypothesis that larger, more capable pre-trained LLMs exhibit belief updates increasingly consistent with Bayesian theory. Specifically, scaling model size enhances the internal coherence of probabilistic inference steps. Persistent under-updating—posterior shifts smaller than the ideal—arises particularly for evidence with low likelihood, suggesting cautious adaptation to rare or surprising information.
From an AI safety and alignment perspective, higher Bayesian coherence implies that models become more predictable as belief updaters, improving their steerability and the transparency of their latent-state inferences. However, a plausible implication is that increased coherence also enables more sophisticated or strategic value pursuit, and could harden resistance to external attempts at correction, with mixed implications for alignment and governance.
BCC’s correlation-based nature distinguishes these positive scaling results from previous findings based on absolute error metrics, which can penalize confidently correct or incorrect updates.
Acknowledged limitations include:
- All evaluated models are pre-trained only, with at most 14B parameters; findings may not transfer to instruction-fine-tuned, RL-fine-tuned, or the largest frontier models.
- BCC depends on cumulative token log-probabilities as proxies for credence, whose fidelity remains empirically undetermined.
- Only one notion (Pearson correlation of log-odds updates) and one axiom of coherence are tested; alternatives such as Dutch-book resistance or infra-Bayesian consistency remain open for further evaluation.
- Dataset biases in prompt phrasing or evidence distribution could subtly affect measured coherence.
- The mechanistic origins of under-updating and the weak dependency on training steps invite deeper theoretical exploration (Imran et al., 23 Jul 2025).
7. Future Research Directions
Natural extensions include:
- Evaluating BCC for models beyond the pre-trained-only regime, including those subject to instruction or RL fine-tuning.
- Testing alternative proxies for internal model credence (e.g., probing representations or comparing different ways of marginalizing token sequences).
- Developing orthogonal Bayesian coherence metrics, such as those based on decision-theoretic or Dutch-book analyses.
- Augmenting the dataset with human-curated or real-world evidence-class histories, and systematically exploring sensitivity to prompt engineering details.
- Clarifying the causes of persistent under-updating, especially for outlier evidence, by linking empirical and mechanistic analysis.
Such work would further delineate the alignment between high-capacity LLMs’ internal updating machinery and the principles of normative Bayesian rationality, deepening the empirical basis for their behavioral predictability, steerability, and governance (Imran et al., 23 Jul 2025).