Papers
Topics
Authors
Recent
2000 character limit reached

Martingale Score in Bayesian Reasoning

Updated 9 December 2025
  • Martingale Score is a metric that quantifies the deviation from the martingale property, defining Bayesian rationality in belief updates.
  • It is computed via ordinary least squares regression, where the slope indicates systematic bias or belief entrenchment in iterative reasoning.
  • Empirical studies reveal that higher absolute martingale scores correlate with poorer probabilistic forecasts, validating its unsupervised assessment approach.

A martingale score is a quantitative metric derived from the martingale property in probability theory and statistics, primarily used to evaluate whether the evolution of probabilistic beliefs, predictions, or model outputs is consistent with principles of Bayesian rationality or statistical exchangeability. Martingale scores are utilized in multiple domains, including the assessment of iterative belief updating in LLMs, score-driven martingale posteriors in Bayesian inference, and online testing for data exchangeability. Recent work, notably "Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning" (He et al., 2 Dec 2025), formalizes the martingale score as an unsupervised regression-based metric to detect violations of Bayesian rationality in LLMs.

1. Foundations: The Martingale Property in Bayesian Reasoning

Let {Bt}t=0T\{B_t\}_{t=0}^T be a sequence of probabilistic beliefs or predicted probabilities output by a model (e.g., an LLM) across TT reasoning steps. The martingale property, foundational in Bayesian statistics, stipulates that under rational updating, the conditional expectation of the next belief given the current belief is equal to the current belief:

E[Bt+1Bt]=Bt\mathbb{E}[B_{t+1} \mid B_t] = B_t

Defining the one-step update as ΔBt:=Bt+1Bt\Delta B_t := B_{t+1} - B_t, this property requires

E[ΔBtBt=p]=0        p[0,1]\mathbb{E}[\Delta B_t \mid B_t = p] = 0 \;\;\; \forall \; p \in [0,1]

A rational Bayesian reasoner thus cannot systematically predict the direction or magnitude of belief updates solely from the present belief; any systematic deviation from unpredictability signals a violation of this principle.

2. Regression-Based Definition and Computation of the Martingale Score

The martingale score MM operationalizes the detection of violations from the martingale property via a linear regression:

Given NN independent samples (b1,Δb1),,(bN,ΔbN)(b_1, \Delta b_1), \dots, (b_N, \Delta b_N), where bib_i is the prior belief and Δbi\Delta b_i the subsequent update, fit

Δbi=β0+β1bi+εi\Delta b_i = \beta_0 + \beta_1 b_i + \varepsilon_i

by ordinary least squares (OLS). The martingale score is defined as the fitted slope:

M=β^1=i=1N(bibˉ)(ΔbiΔb)i=1N(bibˉ)2M = \hat{\beta}_1 = \frac{\sum_{i=1}^N (b_i - \bar{b})(\Delta b_i - \overline{\Delta b})}{\sum_{i=1}^N (b_i - \bar{b})^2}

where bˉ\bar{b} and Δb\overline{\Delta b} denote sample means. A two-sided t-test of MM versus zero yields a p-value assessing statistical significance.

The following table summarizes the key elements:

Element Definition Implementation
Δbi\Delta b_i bi,t+1bi,tb_{i, t+1} - b_{i, t} Per step
Regression Δbi=β0+β1bi+εi\Delta b_i = \beta_0 + \beta_1 b_i + \varepsilon_i OLS fit
Martingale Score MM β^1\hat\beta_1 from regression Slope

3. Interpretation: Detection of Belief Entrenchment and Bayesian Rationality

Interpretation of MM:

  • M0M \approx 0: Updates are on average unpredictable from priors, consistent with Bayesian rationality.
  • M0M \gg 0: Positive dependence, where high priors increase further and low priors decrease—this is termed belief entrenchment (confirmation bias).
  • M<0M < 0: Anti-entrenchment, systematic reversals of prior beliefs.

Empirically, for b[0,1]b \in [0,1], MM typically resides in [1,1][-1, 1], with larger M|M| denoting more severe deviation from the martingale property.

4. Statistical and Practical Framework

Algorithmically, computing MM over a batch of KK trajectories, each a sequence of beliefs {B0(k),,BTk(k)}\{B_0^{(k)}, \dots, B_{T_k}^{(k)}\}, involves extracting pairs (b,Δb)(b, \Delta b), running the described OLS regression, and conducting hypothesis testing on MM using the analytical standard error:

  1. Collate all (b,Δb)(b, \Delta b) from all trajectories.
  2. Fit OLS and estimate MM, its variance, and standard error.
  3. Calculate the t-statistic t=M/SE(M)t = M/\mathrm{SE}(M), and corresponding p-value.
  4. Report (M,p-value)(M, \text{p-value}); large, significant MM denotes Bayesian irrationality in belief updates (He et al., 2 Dec 2025).

For empirical stability and reproducibility, O(100+)O(100+) belief-update events are recommended.

5. Experimental Validation and Correlation with Accuracy Metrics

In forecasting tasks where ground-truth labels yi{0,1}y_i \in \{0, 1\} are available, the martingale score MM is paired with accuracy metrics such as the Brier Score:

BS=1Ni=1N(Bfinal(i)yi)2\mathrm{BS} = \frac{1}{N} \sum_{i=1}^N (B^{(i)}_{\text{final}} - y_i)^2

Correlation analysis and regression reveal that higher M|M| predicts higher Brier scores (i.e., worse probabilistic accuracy). Specifically,

r=j(MjM)(BSjBS)j(MjM)2j(BSjBS)2r = \frac{\sum_j (|M_j|-\overline{|M|})(\text{BS}_j-\overline{\text{BS}})}{\sqrt{\sum_j(|M_j|-\overline{|M|})^2 \sum_j(\text{BS}_j-\overline{\text{BS}})^2}}

shows statistically significant positive correlation, supporting the validity of MM as a proxy for the truth-seeking ability of reasoning processes, even without ground truth (He et al., 2 Dec 2025).

6. Domain-General Applicability and Ground-Truth-Free Regimes

The martingale score operates in a fully unsupervised manner, requiring only pairs of prior and posterior beliefs. This enables its application in settings lacking objective outcome labels, such as value-laden discussions or peer reviews. Numeric beliefs can be extracted from unstructured reasoning chains via separate, well-calibrated LLM-based judges. Cross-rating with humans and between judge models demonstrates robust inter-rater agreement (Pearson r0.8r \approx 0.8), supporting the reliability of automated belief extraction in these contexts.

Beyond LLM evaluation, martingale-based scores play critical roles in other statistical contexts:

  • Score-driven martingale posteriors: Used in Bayesian inference frameworks as an alternative to classic posterior sampling, leveraging stochastic recursions governed by the score function and exhibiting key martingale properties (Cui et al., 3 Jan 2025).
  • Plug-in martingale scores for exchangeability testing: Online testing algorithms monitor conformity to exchangeability via the growth of a martingale process over p-values, providing sequential tests with finite-sample error control (Fedorova et al., 2012).
  • Test martingales in permutation/Monte Carlo testing: Constructed as nonnegative wealth processes, optimized via Kelly betting strategies, enabling anytime-valid testing for exchangeability and controlling Type I error at arbitrary stopping times (Fischer et al., 14 Jan 2024).

These diverse martingale score constructs share the underlying principle: under the targeted null (Bayesian rationality, exchangeability, etc.), predictable gain is impossible; systematic deviation from this "fair game" property is quantifiable, interpretable, and statistically valid for hypothesis testing and reliability assessment.


The martingale score, formalized as the regression slope MM in Δb\Delta b versus bb, constitutes a robust, unsupervised, and interpretable metric for detecting deviations from ideal Bayesian belief updating, identifying belief entrenchment in iterative reasoning, predicting accuracy degradation, and supporting principled evaluation in domains where ground truth is inaccessible (He et al., 2 Dec 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Martingale Score.