Martingale Score in Bayesian Reasoning
- Martingale Score is a metric that quantifies the deviation from the martingale property, defining Bayesian rationality in belief updates.
- It is computed via ordinary least squares regression, where the slope indicates systematic bias or belief entrenchment in iterative reasoning.
- Empirical studies reveal that higher absolute martingale scores correlate with poorer probabilistic forecasts, validating its unsupervised assessment approach.
A martingale score is a quantitative metric derived from the martingale property in probability theory and statistics, primarily used to evaluate whether the evolution of probabilistic beliefs, predictions, or model outputs is consistent with principles of Bayesian rationality or statistical exchangeability. Martingale scores are utilized in multiple domains, including the assessment of iterative belief updating in LLMs, score-driven martingale posteriors in Bayesian inference, and online testing for data exchangeability. Recent work, notably "Martingale Score: An Unsupervised Metric for Bayesian Rationality in LLM Reasoning" (He et al., 2 Dec 2025), formalizes the martingale score as an unsupervised regression-based metric to detect violations of Bayesian rationality in LLMs.
1. Foundations: The Martingale Property in Bayesian Reasoning
Let be a sequence of probabilistic beliefs or predicted probabilities output by a model (e.g., an LLM) across reasoning steps. The martingale property, foundational in Bayesian statistics, stipulates that under rational updating, the conditional expectation of the next belief given the current belief is equal to the current belief:
Defining the one-step update as , this property requires
A rational Bayesian reasoner thus cannot systematically predict the direction or magnitude of belief updates solely from the present belief; any systematic deviation from unpredictability signals a violation of this principle.
2. Regression-Based Definition and Computation of the Martingale Score
The martingale score operationalizes the detection of violations from the martingale property via a linear regression:
Given independent samples , where is the prior belief and the subsequent update, fit
by ordinary least squares (OLS). The martingale score is defined as the fitted slope:
where and denote sample means. A two-sided t-test of versus zero yields a p-value assessing statistical significance.
The following table summarizes the key elements:
| Element | Definition | Implementation |
|---|---|---|
| Per step | ||
| Regression | OLS fit | |
| Martingale Score | from regression | Slope |
3. Interpretation: Detection of Belief Entrenchment and Bayesian Rationality
Interpretation of :
- : Updates are on average unpredictable from priors, consistent with Bayesian rationality.
- : Positive dependence, where high priors increase further and low priors decrease—this is termed belief entrenchment (confirmation bias).
- : Anti-entrenchment, systematic reversals of prior beliefs.
Empirically, for , typically resides in , with larger denoting more severe deviation from the martingale property.
4. Statistical and Practical Framework
Algorithmically, computing over a batch of trajectories, each a sequence of beliefs , involves extracting pairs , running the described OLS regression, and conducting hypothesis testing on using the analytical standard error:
- Collate all from all trajectories.
- Fit OLS and estimate , its variance, and standard error.
- Calculate the t-statistic , and corresponding p-value.
- Report ; large, significant denotes Bayesian irrationality in belief updates (He et al., 2 Dec 2025).
For empirical stability and reproducibility, belief-update events are recommended.
5. Experimental Validation and Correlation with Accuracy Metrics
In forecasting tasks where ground-truth labels are available, the martingale score is paired with accuracy metrics such as the Brier Score:
Correlation analysis and regression reveal that higher predicts higher Brier scores (i.e., worse probabilistic accuracy). Specifically,
shows statistically significant positive correlation, supporting the validity of as a proxy for the truth-seeking ability of reasoning processes, even without ground truth (He et al., 2 Dec 2025).
6. Domain-General Applicability and Ground-Truth-Free Regimes
The martingale score operates in a fully unsupervised manner, requiring only pairs of prior and posterior beliefs. This enables its application in settings lacking objective outcome labels, such as value-laden discussions or peer reviews. Numeric beliefs can be extracted from unstructured reasoning chains via separate, well-calibrated LLM-based judges. Cross-rating with humans and between judge models demonstrates robust inter-rater agreement (Pearson ), supporting the reliability of automated belief extraction in these contexts.
7. Related Martingale-Based Metrics and Broader Statistical Context
Beyond LLM evaluation, martingale-based scores play critical roles in other statistical contexts:
- Score-driven martingale posteriors: Used in Bayesian inference frameworks as an alternative to classic posterior sampling, leveraging stochastic recursions governed by the score function and exhibiting key martingale properties (Cui et al., 3 Jan 2025).
- Plug-in martingale scores for exchangeability testing: Online testing algorithms monitor conformity to exchangeability via the growth of a martingale process over p-values, providing sequential tests with finite-sample error control (Fedorova et al., 2012).
- Test martingales in permutation/Monte Carlo testing: Constructed as nonnegative wealth processes, optimized via Kelly betting strategies, enabling anytime-valid testing for exchangeability and controlling Type I error at arbitrary stopping times (Fischer et al., 14 Jan 2024).
These diverse martingale score constructs share the underlying principle: under the targeted null (Bayesian rationality, exchangeability, etc.), predictable gain is impossible; systematic deviation from this "fair game" property is quantifiable, interpretable, and statistically valid for hypothesis testing and reliability assessment.
The martingale score, formalized as the regression slope in versus , constitutes a robust, unsupervised, and interpretable metric for detecting deviations from ideal Bayesian belief updating, identifying belief entrenchment in iterative reasoning, predicting accuracy degradation, and supporting principled evaluation in domains where ground truth is inaccessible (He et al., 2 Dec 2025).