Martingale Score in Bayesian Reasoning

Updated 9 December 2025

Martingale Score is a metric that quantifies the deviation from the martingale property, defining Bayesian rationality in belief updates.
It is computed via ordinary least squares regression, where the slope indicates systematic bias or belief entrenchment in iterative reasoning.
Empirical studies reveal that higher absolute martingale scores correlate with poorer probabilistic forecasts, validating its unsupervised assessment approach.

A martingale score is a quantitative metric derived from the martingale property in probability theory and statistics, primarily used to evaluate whether the evolution of probabilistic beliefs, predictions, or model outputs is consistent with principles of Bayesian rationality or statistical exchangeability. Martingale scores are utilized in multiple domains, including the assessment of iterative belief updating in LLMs, score-driven martingale posteriors in Bayesian inference, and online testing for data exchangeability. Recent work, notably "Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning" (He et al., 2 Dec 2025), formalizes the martingale score as an unsupervised regression-based metric to detect violations of Bayesian rationality in LLMs.

1. Foundations: The Martingale Property in Bayesian Reasoning

Let $\{B_t\}_{t=0}^T$ be a sequence of probabilistic beliefs or predicted probabilities output by a model (e.g., an LLM) across $T$ reasoning steps. The martingale property, foundational in Bayesian statistics, stipulates that under rational updating, the conditional expectation of the next belief given the current belief is equal to the current belief:

$\mathbb{E}[B_{t+1} \mid B_t] = B_t$

Defining the one-step update as $\Delta B_t := B_{t+1} - B_t$ , this property requires

$\mathbb{E}[\Delta B_t \mid B_t = p] = 0 \;\;\; \forall \; p \in [0,1]$

A rational Bayesian reasoner thus cannot systematically predict the direction or magnitude of belief updates solely from the present belief; any systematic deviation from unpredictability signals a violation of this principle.

2. Regression-Based Definition and Computation of the Martingale Score

The martingale score $M$ operationalizes the detection of violations from the martingale property via a linear regression:

Given $N$ independent samples $(b_1, \Delta b_1), \dots, (b_N, \Delta b_N)$ , where $b_i$ is the prior belief and $\Delta b_i$ the subsequent update, fit

$\Delta b_i = \beta_0 + \beta_1 b_i + \varepsilon_i$

by ordinary least squares (OLS). The martingale score is defined as the fitted slope:

$M = \hat{\beta}_1 = \frac{\sum_{i=1}^N (b_i - \bar{b})(\Delta b_i - \overline{\Delta b})}{\sum_{i=1}^N (b_i - \bar{b})^2}$

where $\bar{b}$ and $\overline{\Delta b}$ denote sample means. A two-sided t-test of $M$ versus zero yields a p-value assessing statistical significance.

The following table summarizes the key elements:

Element	Definition	Implementation
$\Delta b_i$	$b_{i, t+1} - b_{i, t}$	Per step
Regression	$\Delta b_i = \beta_0 + \beta_1 b_i + \varepsilon_i$	OLS fit
Martingale Score $M$	$\hat\beta_1$ from regression	Slope

3. Interpretation: Detection of Belief Entrenchment and Bayesian Rationality

Interpretation of $M$ :

$M \approx 0$ : Updates are on average unpredictable from priors, consistent with Bayesian rationality.
$M \gg 0$ : Positive dependence, where high priors increase further and low priors decrease—this is termed belief entrenchment (confirmation bias).
$M < 0$ : Anti-entrenchment, systematic reversals of prior beliefs.

Empirically, for $b \in [0,1]$ , $M$ typically resides in $[-1, 1]$ , with larger $|M|$ denoting more severe deviation from the martingale property.

4. Statistical and Practical Framework

Algorithmically, computing $M$ over a batch of $K$ trajectories, each a sequence of beliefs $\{B_0^{(k)}, \dots, B_{T_k}^{(k)}\}$ , involves extracting pairs $(b, \Delta b)$ , running the described OLS regression, and conducting hypothesis testing on $M$ using the analytical standard error:

Collate all $(b, \Delta b)$ from all trajectories.
Fit OLS and estimate $M$ , its variance, and standard error.
Calculate the t-statistic $t = M/\mathrm{SE}(M)$ , and corresponding p-value.
Report $(M, \text{p-value})$ ; large, significant $M$ denotes Bayesian irrationality in belief updates (He et al., 2 Dec 2025).

For empirical stability and reproducibility, $O(100+)$ belief-update events are recommended.

5. Experimental Validation and Correlation with Accuracy Metrics

In forecasting tasks where ground-truth labels $y_i \in \{0, 1\}$ are available, the martingale score $M$ is paired with accuracy metrics such as the Brier Score:

$\mathrm{BS} = \frac{1}{N} \sum_{i=1}^N (B^{(i)}_{\text{final}} - y_i)^2$

Correlation analysis and regression reveal that higher $|M|$ predicts higher Brier scores (i.e., worse probabilistic accuracy). Specifically,

$r = \frac{\sum_j (|M_j|-\overline{|M|})(\text{BS}_j-\overline{\text{BS}})}{\sqrt{\sum_j(|M_j|-\overline{|M|})^2 \sum_j(\text{BS}_j-\overline{\text{BS}})^2}}$

shows statistically significant positive correlation, supporting the validity of $M$ as a proxy for the truth-seeking ability of reasoning processes, even without ground truth (He et al., 2 Dec 2025).

6. Domain-General Applicability and Ground-Truth-Free Regimes

The martingale score operates in a fully unsupervised manner, requiring only pairs of prior and posterior beliefs. This enables its application in settings lacking objective outcome labels, such as value-laden discussions or peer reviews. Numeric beliefs can be extracted from unstructured reasoning chains via separate, well-calibrated LLM-based judges. Cross-rating with humans and between judge models demonstrates robust inter-rater agreement (Pearson $r \approx 0.8$ ), supporting the reliability of automated belief extraction in these contexts.

Beyond LLM evaluation, martingale-based scores play critical roles in other statistical contexts:

Score-driven martingale posteriors: Used in Bayesian inference frameworks as an alternative to classic posterior sampling, leveraging stochastic recursions governed by the score function and exhibiting key martingale properties (Cui et al., 3 Jan 2025).
Plug-in martingale scores for exchangeability testing: Online testing algorithms monitor conformity to exchangeability via the growth of a martingale process over p-values, providing sequential tests with finite-sample error control (Fedorova et al., 2012).
Test martingales in permutation/Monte Carlo testing: Constructed as nonnegative wealth processes, optimized via Kelly betting strategies, enabling anytime-valid testing for exchangeability and controlling Type I error at arbitrary stopping times (Fischer et al., 14 Jan 2024).

These diverse martingale score constructs share the underlying principle: under the targeted null (Bayesian rationality, exchangeability, etc.), predictable gain is impossible; systematic deviation from this "fair game" property is quantifiable, interpretable, and statistically valid for hypothesis testing and reliability assessment.

The martingale score, formalized as the regression slope $M$ in $\Delta b$ versus $b$ , constitutes a robust, unsupervised, and interpretable metric for detecting deviations from ideal Bayesian belief updating, identifying belief entrenchment in iterative reasoning, predicting accuracy degradation, and supporting principled evaluation in domains where ground truth is inaccessible (He et al., 2 Dec 2025).