Bayesian Coherence Coefficient (BCC)

Updated 7 June 2026

BCC is a metric that quantifies the alignment between expected and observed log-odds updates in LLMs, serving as a measure of Bayesian coherence.
It isolates the belief-updating mechanism by comparing model-generated evidential updates to ideal Bayesian log-odds derived from prior and likelihood assessments.
Empirical results show that larger model capacities yield higher BCC values, highlighting improvements in model alignment and implications for AI safety.

The Bayesian Coherence Coefficient (BCC) is a quantitative metric introduced to assess the degree to which LLMs update their in-context credences in a manner consistent with Bayes’ theorem. Unlike conventional performance metrics that benchmark zero-shot accuracy or calibration, the BCC isolates the core mechanism of belief-updating under new evidence, quantifying the coherence between expected and observed log-odds updates for model predictions. BCC provides a rigorous, scale-sensitive lens on whether LLMs’ conditional probability assignments and sequential updates approximate Bayesian rationality, with implications for the predictability, alignment, and safety of next-generation AI systems (Imran et al., 23 Jul 2025).

1. Mathematical Formalism and Definition

The BCC is grounded in Bayes’ theorem over a set of mutually exclusive and exhaustive classes $\mathcal{C}$ . Given evidence $x$ and prior information encapsulated in a conversation history $h$ , an ideal Bayesian agent forms updated credences via:

$P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$

For two classes $c_1, c_2 \in \mathcal{C}$ , it is expedient to work with log-odds ratios:

$\log \frac{P(c_1|x)}{P(c_2|x)} = \log \frac{P(x|c_1)P(c_1)}{P(x|c_2)P(c_2)}$

The Bayesian Coherence Coefficient compares:

Expected log-odds update (Δ_expected):

$\Delta_\text{expected} = \log \frac{P_\theta(x|c_1, h)}{P_\theta(x|c_2, h)}$

Observed log-odds update (Δ_observed):

$\Delta_\text{observed} = [\log P_\theta(c_1|x,h) - \log P_\theta(c_2|x,h)] - [\log P_\theta(c_1|h) - \log P_\theta(c_2|h)]$

Here, $P_\theta(\cdot|\cdot)$ denotes the LLM's conditional token probabilities. The BCC is then defined as the Pearson correlation across an evaluation dataset $D$ :

$x$ 0

A BCC of $x$ 1 corresponds to perfect linear coherence with Bayes’ rule, while $x$ 2 denotes random or uncorrelated updates (Imran et al., 23 Jul 2025).

2. Theoretical Foundations and Significance

Bayesian rationality requires that posterior log-odds shifts match the evidence’s log likelihood ratio. The BCC operationalizes this principle by quantifying alignment between the theoretically required and empirically inferred log-odds updates in LLMs. Unlike calibration metrics or accuracy scores, BCC directly measures the pure updating step, disregarding baseline credence distributions or marginal likelihoods.

If observed and expected updates align perfectly (points fall on the 45° line in scatter plots), the model is Bayes-optimal. Deviations—where updates are systematically weaker (under-updating) or stronger (over-updating)—are reflected in gradients and correlation strengths. High BCC thus indicates that the model’s credence-shifting mechanism is internally consistent and normatively sound, whereas low BCC reveals incoherence or heuristic-driven updating.

3. Dataset and Evaluation Procedure

Evaluation of BCC requires a dataset $x$ 3 comprising tuples $x$ 4, systematically constructed for discriminative coverage:

Ten semantic categories (e.g., schools of philosophy, genres of music), each with at least five candidate classes of equal token length (≤3 tokens).
For each category, 20 evidence strings, labeled for which classes they support.
Three conversation histories per category, varying in relevance.
Evidence and class labels are generated via GPT-4o with controlled prompts to ensure linguistic richness and comparability.

This design yields approximately 6460 distinct four-tuples, sufficient for robust statistical analysis of correlation metrics like BCC (Imran et al., 23 Jul 2025).

4. Experimental Protocol

For each pre-trained LLM $x$ 5 and evaluation tuple:

Prior elicitation: Concatenate a fixed class_elicitation string and class label $x$ 6 with history $x$ 7 and record $x$ 8.
Likelihood elicitation: Present $x$ 9 and the phrase “Given that the correct class is $h$ 0,” then add the evidence string $h$ 1; record $h$ 2.
Posterior elicitation: Provide $h$ 3, the class_elicitation string, $h$ 4, and the evidence string $h$ 5; record $h$ 6.

Each probability is elicited via separate, stateless API calls at temperature $h$ 7 to preclude state leakage. $h$ 8 and $h$ 9 are calculated for all pairs and pooled for BCC computation via Pearson’s $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 0 over the dataset.

5. Empirical Findings

Main results highlight strong positive scaling of BCC with model size and benchmark capability:

All evaluated pre-trained models (GPT-2, Pythia, Llama 3, Falcon 3, Qwen 2.5) achieve BCC $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 1, indicating systematic—if imperfect—Bayesian updating.
BCC correlates strongly with log(model parameter count), with a fitted $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 2 ( $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 3), and improves with capacity within model families. For instance, Pythia 0.21B yields BCC $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 4, while Pythia 12B attains $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 5.
Models consistently under-update: observed–expected update gradient $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 6, but this approaches unity at larger scales.
Positive but statistically non-significant increases in BCC with training step count are observed within the Pythia family.
BCC correlates positively with performance on most standard benchmarks (BIG-Bench Hard, GPQA, MMLU-PRO, Math 5; $P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}$ 7), with weaker trends on others (IFEval, MuSR).

Table: Exemplary BCC Statistics for Pythia Models

Model	BCC	Update Gradient	Direction Agreement (%)
Pythia 0.21B	0.505	0.340	63.7
Pythia 12B	0.681	0.396	73.7

6. Interpretation, Implications, and Limitations

The results support the hypothesis that larger, more capable pre-trained LLMs exhibit belief updates increasingly consistent with Bayesian theory. Specifically, scaling model size enhances the internal coherence of probabilistic inference steps. Persistent under-updating—posterior shifts smaller than the ideal—arises particularly for evidence with low likelihood, suggesting cautious adaptation to rare or surprising information.

From an AI safety and alignment perspective, higher Bayesian coherence implies that models become more predictable as belief updaters, improving their steerability and the transparency of their latent-state inferences. However, a plausible implication is that increased coherence also enables more sophisticated or strategic value pursuit, and could harden resistance to external attempts at correction, with mixed implications for alignment and governance.

BCC’s correlation-based nature distinguishes these positive scaling results from previous findings based on absolute error metrics, which can penalize confidently correct or incorrect updates.

Acknowledged limitations include:

All evaluated models are pre-trained only, with at most 14B parameters; findings may not transfer to instruction-fine-tuned, RL-fine-tuned, or the largest frontier models.
BCC depends on cumulative token log-probabilities as proxies for credence, whose fidelity remains empirically undetermined.
Only one notion (Pearson correlation of log-odds updates) and one axiom of coherence are tested; alternatives such as Dutch-book resistance or infra-Bayesian consistency remain open for further evaluation.
Dataset biases in prompt phrasing or evidence distribution could subtly affect measured coherence.
The mechanistic origins of under-updating and the weak dependency on training steps invite deeper theoretical exploration (Imran et al., 23 Jul 2025).

7. Future Research Directions

Natural extensions include:

Evaluating BCC for models beyond the pre-trained-only regime, including those subject to instruction or RL fine-tuning.
Testing alternative proxies for internal model credence (e.g., probing representations or comparing different ways of marginalizing token sequences).
Developing orthogonal Bayesian coherence metrics, such as those based on decision-theoretic or Dutch-book analyses.
Augmenting the dataset with human-curated or real-world evidence-class histories, and systematically exploring sensitivity to prompt engineering details.
Clarifying the causes of persistent under-updating, especially for outlier evidence, by linking empirical and mechanistic analysis.

Such work would further delineate the alignment between high-capacity LLMs’ internal updating machinery and the principles of normative Bayesian rationality, deepening the empirical basis for their behavioral predictability, steerability, and governance (Imran et al., 23 Jul 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Are LLM Belief Updates Consistent with Bayes' Theorem? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Coherence Coefficient (BCC).

Bayesian Coherence Coefficient (BCC)

1. Mathematical Formalism and Definition

2. Theoretical Foundations and Significance

3. Dataset and Evaluation Procedure

4. Experimental Protocol

5. Empirical Findings

6. Interpretation, Implications, and Limitations

7. Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Bayesian Coherence Coefficient (BCC)

1. Mathematical Formalism and Definition

2. Theoretical Foundations and Significance

3. Dataset and Evaluation Procedure

4. Experimental Protocol

5. Empirical Findings

6. Interpretation, Implications, and Limitations

7. Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research