Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian Coherence Coefficient (BCC)

Updated 7 June 2026
  • BCC is a metric that quantifies the alignment between expected and observed log-odds updates in LLMs, serving as a measure of Bayesian coherence.
  • It isolates the belief-updating mechanism by comparing model-generated evidential updates to ideal Bayesian log-odds derived from prior and likelihood assessments.
  • Empirical results show that larger model capacities yield higher BCC values, highlighting improvements in model alignment and implications for AI safety.

The Bayesian Coherence Coefficient (BCC) is a quantitative metric introduced to assess the degree to which LLMs update their in-context credences in a manner consistent with Bayes’ theorem. Unlike conventional performance metrics that benchmark zero-shot accuracy or calibration, the BCC isolates the core mechanism of belief-updating under new evidence, quantifying the coherence between expected and observed log-odds updates for model predictions. BCC provides a rigorous, scale-sensitive lens on whether LLMs’ conditional probability assignments and sequential updates approximate Bayesian rationality, with implications for the predictability, alignment, and safety of next-generation AI systems (Imran et al., 23 Jul 2025).

1. Mathematical Formalism and Definition

The BCC is grounded in Bayes’ theorem over a set of mutually exclusive and exhaustive classes C\mathcal{C}. Given evidence xx and prior information encapsulated in a conversation history hh, an ideal Bayesian agent forms updated credences via:

P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}

For two classes c1,c2Cc_1, c_2 \in \mathcal{C}, it is expedient to work with log-odds ratios:

logP(c1x)P(c2x)=logP(xc1)P(c1)P(xc2)P(c2)\log \frac{P(c_1|x)}{P(c_2|x)} = \log \frac{P(x|c_1)P(c_1)}{P(x|c_2)P(c_2)}

The Bayesian Coherence Coefficient compares:

  • Expected log-odds update (Δ_expected):

Δexpected=logPθ(xc1,h)Pθ(xc2,h)\Delta_\text{expected} = \log \frac{P_\theta(x|c_1, h)}{P_\theta(x|c_2, h)}

  • Observed log-odds update (Δ_observed):

Δobserved=[logPθ(c1x,h)logPθ(c2x,h)][logPθ(c1h)logPθ(c2h)]\Delta_\text{observed} = [\log P_\theta(c_1|x,h) - \log P_\theta(c_2|x,h)] - [\log P_\theta(c_1|h) - \log P_\theta(c_2|h)]

Here, Pθ()P_\theta(\cdot|\cdot) denotes the LLM's conditional token probabilities. The BCC is then defined as the Pearson correlation across an evaluation dataset DD:

xx0

A BCC of xx1 corresponds to perfect linear coherence with Bayes’ rule, while xx2 denotes random or uncorrelated updates (Imran et al., 23 Jul 2025).

2. Theoretical Foundations and Significance

Bayesian rationality requires that posterior log-odds shifts match the evidence’s log likelihood ratio. The BCC operationalizes this principle by quantifying alignment between the theoretically required and empirically inferred log-odds updates in LLMs. Unlike calibration metrics or accuracy scores, BCC directly measures the pure updating step, disregarding baseline credence distributions or marginal likelihoods.

If observed and expected updates align perfectly (points fall on the 45° line in scatter plots), the model is Bayes-optimal. Deviations—where updates are systematically weaker (under-updating) or stronger (over-updating)—are reflected in gradients and correlation strengths. High BCC thus indicates that the model’s credence-shifting mechanism is internally consistent and normatively sound, whereas low BCC reveals incoherence or heuristic-driven updating.

3. Dataset and Evaluation Procedure

Evaluation of BCC requires a dataset xx3 comprising tuples xx4, systematically constructed for discriminative coverage:

  • Ten semantic categories (e.g., schools of philosophy, genres of music), each with at least five candidate classes of equal token length (≤3 tokens).
  • For each category, 20 evidence strings, labeled for which classes they support.
  • Three conversation histories per category, varying in relevance.
  • Evidence and class labels are generated via GPT-4o with controlled prompts to ensure linguistic richness and comparability.

This design yields approximately 6460 distinct four-tuples, sufficient for robust statistical analysis of correlation metrics like BCC (Imran et al., 23 Jul 2025).

4. Experimental Protocol

For each pre-trained LLM xx5 and evaluation tuple:

  • Prior elicitation: Concatenate a fixed class_elicitation string and class label xx6 with history xx7 and record xx8.
  • Likelihood elicitation: Present xx9 and the phrase “Given that the correct class is hh0,” then add the evidence string hh1; record hh2.
  • Posterior elicitation: Provide hh3, the class_elicitation string, hh4, and the evidence string hh5; record hh6.

Each probability is elicited via separate, stateless API calls at temperature hh7 to preclude state leakage. hh8 and hh9 are calculated for all pairs and pooled for BCC computation via Pearson’s P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}0 over the dataset.

5. Empirical Findings

Main results highlight strong positive scaling of BCC with model size and benchmark capability:

  • All evaluated pre-trained models (GPT-2, Pythia, Llama 3, Falcon 3, Qwen 2.5) achieve BCC P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}1, indicating systematic—if imperfect—Bayesian updating.
  • BCC correlates strongly with log(model parameter count), with a fitted P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}2 (P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}3), and improves with capacity within model families. For instance, Pythia 0.21B yields BCC P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}4, while Pythia 12B attains P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}5.
  • Models consistently under-update: observed–expected update gradient P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}6, but this approaches unity at larger scales.
  • Positive but statistically non-significant increases in BCC with training step count are observed within the Pythia family.
  • BCC correlates positively with performance on most standard benchmarks (BIG-Bench Hard, GPQA, MMLU-PRO, Math 5; P(cx)=P(xc)P(c)cCP(xc)P(c)P(c|x) = \frac{P(x|c)P(c)}{\sum_{c' \in \mathcal{C}} P(x|c') P(c')}7), with weaker trends on others (IFEval, MuSR).

Table: Exemplary BCC Statistics for Pythia Models

Model BCC Update Gradient Direction Agreement (%)
Pythia 0.21B 0.505 0.340 63.7
Pythia 12B 0.681 0.396 73.7

6. Interpretation, Implications, and Limitations

The results support the hypothesis that larger, more capable pre-trained LLMs exhibit belief updates increasingly consistent with Bayesian theory. Specifically, scaling model size enhances the internal coherence of probabilistic inference steps. Persistent under-updating—posterior shifts smaller than the ideal—arises particularly for evidence with low likelihood, suggesting cautious adaptation to rare or surprising information.

From an AI safety and alignment perspective, higher Bayesian coherence implies that models become more predictable as belief updaters, improving their steerability and the transparency of their latent-state inferences. However, a plausible implication is that increased coherence also enables more sophisticated or strategic value pursuit, and could harden resistance to external attempts at correction, with mixed implications for alignment and governance.

BCC’s correlation-based nature distinguishes these positive scaling results from previous findings based on absolute error metrics, which can penalize confidently correct or incorrect updates.

Acknowledged limitations include:

  • All evaluated models are pre-trained only, with at most 14B parameters; findings may not transfer to instruction-fine-tuned, RL-fine-tuned, or the largest frontier models.
  • BCC depends on cumulative token log-probabilities as proxies for credence, whose fidelity remains empirically undetermined.
  • Only one notion (Pearson correlation of log-odds updates) and one axiom of coherence are tested; alternatives such as Dutch-book resistance or infra-Bayesian consistency remain open for further evaluation.
  • Dataset biases in prompt phrasing or evidence distribution could subtly affect measured coherence.
  • The mechanistic origins of under-updating and the weak dependency on training steps invite deeper theoretical exploration (Imran et al., 23 Jul 2025).

7. Future Research Directions

Natural extensions include:

  • Evaluating BCC for models beyond the pre-trained-only regime, including those subject to instruction or RL fine-tuning.
  • Testing alternative proxies for internal model credence (e.g., probing representations or comparing different ways of marginalizing token sequences).
  • Developing orthogonal Bayesian coherence metrics, such as those based on decision-theoretic or Dutch-book analyses.
  • Augmenting the dataset with human-curated or real-world evidence-class histories, and systematically exploring sensitivity to prompt engineering details.
  • Clarifying the causes of persistent under-updating, especially for outlier evidence, by linking empirical and mechanistic analysis.

Such work would further delineate the alignment between high-capacity LLMs’ internal updating machinery and the principles of normative Bayesian rationality, deepening the empirical basis for their behavioral predictability, steerability, and governance (Imran et al., 23 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Coherence Coefficient (BCC).