Stochastic Analysis of Elo

Updated 16 February 2026

Stochastic Analysis of Elo is a framework that reinterprets player rating updates as an online gradient descent method with no-regret guarantees in sparse data regimes.
It leverages Markov chain properties and stationary distribution analysis to ensure convergence, controlled bias, and predictable variance in rating estimates.
Extensions such as Bayesian updates and adaptive draw adjustments enhance robustness against model misspecification and support multi-outcome game scenarios.

The Elo rating system is a widely used online algorithm for estimating and tracking the relative skill levels of players or teams in competitive environments. Its original formulation, designed for chess, has undergone extensive stochastic analysis, leading to a comprehensive probabilistic and statistical understanding of its behavior, reliability, and extensions. This analysis reveals Elo as a stochastic gradient or online learning algorithm, admits Markov chain characterizations, admits Bayesian and kinetic formulations, and supports principled modifications for practical demands such as draws, non-stationarity, model misspecification, and tie-breaking extensions.

1. Probabilistic and Online-Optimization Foundations

The core of Elo is the stochastic update

$\theta_{t+1}[i_t] \leftarrow \theta_t[i_t] + K \left(o_t - p_t\right), \qquad \theta_{t+1}[j_t] \leftarrow \theta_t[j_t] - K \left(o_t - p_t\right)$

where $o_t\in\{0,1\}$ is the observed outcome (win/loss) for the selected pair $(i_t, j_t)$ at time $t$ , and $p_t = \sigma(\theta_t[i_t] - \theta_t[j_t])$ is the model's predicted win probability, typically from a logistic (Bradley–Terry) or Thurstone model. This is precisely a single step of online gradient descent (OGD) for convex loss (binary cross-entropy), and thus inherits no-regret guarantees: $\text{Regret}_T = O\left(\sqrt{N T}\right),\quad \text{per-round regret} = O\left(\sqrt{N/T}\right)$ independent of whether the data-generation process actually fits the Bradley–Terry model or even remains stationary. The result is that Elo's predictive performance in sparse data regimes (matches per player $\sim$ 10–100) is often superior to complex, high-parameter alternatives, whose regret grows with model size (Tang et al., 16 Feb 2025).

2. Markov Chain and Stationary Distribution Analysis

The stochastic update process can be formulated as a Markov chain on the zero-sum (or mean-constrained) subspace of $\mathbb{R}^N$ . Let $X_t$ denote the vector of ratings after $t$ steps, and $\rho$ be fixed “true skills”: $X_{t+1} = X_t + K (S_t - b(X_t^{i_t} - X_t^{j_t}))(e_{i_t} - e_{j_t})$ where $S_t$ is the realized game outcome, and $b$ is the (odd, increasing) score function (e.g., tanh or logistic). Under mild regularity (Lipschitz, centered outcome models, irreducibility), the following hold (Cortez et al., 2024, Olesker-Taylor et al., 2024):

The Markov process contracts almost surely; two Elo chains with the same random events converge.
There is a unique stationary law $\pi$ (invariant measure) towards which the process converges in Wasserstein metrics.
Exponential moments and full support: $\pi$ has finite exponential moments and is supported everywhere in the legal state-space.

Critically, in the stationary regime, the mean Elo rating is generally biased with respect to $\rho$ , but the predicted win rates computed from Elo are asymptotically unbiased: $\mathbb{E}_\pi[b(X^i - X^j)] = b(\rho^i - \rho^j)$ Moreover, as $K \to 0$ , the mean absolute error in ratings scales as $O(\sqrt{K})$ (Cortez et al., 2024).

3. Stochastic Dynamics: Kalman/Bayesian and Kinetic Approaches

The Elo update can be derived as the limiting case in a stochastic Bayesian tracking framework. Under generic linear-Gaussian skill evolution ( $\theta_t = B_t \theta_{t-1} + w_t$ ) and probabilistic game outcome models, the exact Bayesian posterior is approximated by a single-step Laplace (Gaussian) update, yielding: $K_t = \frac{\sigma_t^2}{s^2 + \sigma_t^2}$ That is, the learning rate $K$ is naturally interpreted as a Kalman gain, decaying as player ratings become more certain. For constant prior variance and batch sizes, this recovers the fixed- $K$ Elo estimator (Szczecinski et al., 2021). Allowing $\sigma_i^2$ to be updated yields variants equivalent to per-player adaptive $K$ -factors and reproduces the essential mechanics of the Glicko system but with fully online updates (Hua et al., 2023).

On the macroscopic scale, kinetic theory yields Boltzmann and Fokker–Planck equations for the empirical rating distribution $f(\rho, R, t)$ . In the limit of many players, rating and intrinsic skill dynamics are captured by a nonlinear, nonlocal Fokker–Planck PDE, predicting:

The asymptotic empirical distribution is (approximately) Gaussian, with variance growing as $\sigma^2(t) \sim \ln t$ .
With player entry at rates biased below the mean, negative skew emerges, matching empirical chess rating distributions (Fenner et al., 2011).
In learning models, intrinsic skill drifts upward and the rating law converges toward a diagonal ridge in $(\rho, R)$ due to ongoing adaptation (Düring et al., 2018).

4. Extensions: Draws, Discretized Outcomes, Margin of Victory

The original Elo algorithm assumes binary outcomes. Pragmatically, draws and multi-outcome games are critical. The stochastic analysis reveals:

Classic Elo with the rule “draw = half-win” is not ad hoc; it is optimal for a specific ternary-outcome likelihood, i.e.,

$P(H\mid v) = \Phi^2(v),\quad P(A\mid v) = \Phi^2(-v),\quad P(D\mid v) = 2\Phi(v)\Phi(-v)$

where $\Phi$ is the logistic CDF.

Generalizations (κ–Elo) parameterize the draw rate, enabling better fit to real sports by tuning the draw-probabilities via a Davidson parameter; the per-step update is still $K(s - F_\kappa(v))$ with $s\in\{0, 1/2, 1\}$ or appropriate fractional scores (Szczecinski et al., 2019).
More generally, the rating update can handle discretized margin-of-victory (MOV) models, leading to “G-Elo” or similar. The log-likelihood is maximized by stochastic gradient, and the per-player rating update continues to take the form $K(S - E)$ , but with carefully defined score and expectation functions for each MOV bin (Szczecinski, 2020, Moreland et al., 2018).

5. Finite-Time Behavior, Convergence Rates, and Optimal K-factor Design

For fixed or decaying $K$ and stationary match scheduling, Elo converges exponentially (in mean) toward ground-truth skills when the step-size $K$ is chosen appropriately, with variance governed by the matched “noise” (outcome randomness) and the sample complexity dictated by the number of matches per player. In round-robin tournaments, the explicit formulas for mean, variance, bias, and mean-square error per time step are given by (Zanco et al., 2022):

Quantity	Formula (see text for definitions)	Regime/Interpretation
Mean convergence	$E[\theta_{k+1, m}] = (1-\alpha_1^k)\theta^*_m$	Exponential to truth, rate $\tau_1$
Variance	$d_\infty = \frac{(M-1)Kh}{2(h - Kh^2)}$	Steady-state MSD (mean-square deviation)
Loss excess	$\ell_{ex,k} \approx (h/(2(M-1)))d_k$	Excess log-loss over Bayes optimal
Optimum $K$	$K_{opt}(k) \sim \left\{ ... \right\}^{-1}$	See full expressions for optimal step

To guarantee progress, $K$ must lie below a stability threshold (typically $K<2v$ for skill variance $v$ ), and is minimized for each tournament horizon. Choosing $K$ too small retards adaptation; too large increases estimation variance.

Spectral analysis of the underlying Markov chain, defined by the Laplacian $\Delta$ of the match graph, reveals that convergence accelerates with the spectral gap $\lambda_q$ . Optimal tournament design reduces to the fastest-mixing Markov chain problem, which can be solved by convex programming over match frequency distributions (Olesker-Taylor et al., 2024).

6. Model Misspecification and Reliability in Practice

Empirical analysis demonstrates that actual sports and game datasets rarely adhere to perfect Bradley–Terry (logistic-difference) structure or stationarity. Despite this, classic Elo's stochastic foundation (as OGD on log-loss) conveys a universal no-regret property: for sparse data, the approximation error is bounded, but the regret due to dimensionality of the parameter space dominates (Tang et al., 16 Feb 2025). This effect explains the empirical superiority of traditional Elo—simple, low-dimensional, and robust—over complex pairwise or high-capacity extensions in real-world settings with limited per-player data.

Moreover, Elo's predictive accuracy in ranking and win-rate estimates strongly correlates with empirical ranking performance, across competitive gaming and even large-scale application domains such as LLM comparative evaluation (Tang et al., 16 Feb 2025).

7. Summary Table: Theoretical Results and Empirical Features

Aspect	Main Findings/Implications	Source ArXiv IDs
Probabilistic Model	Elo as OGD/SGD for cross-entropy loss (no-regret)	(Tang et al., 16 Feb 2025, Szczecinski et al., 2021)
Markov Chain Structure	Unique stationary law, $\sqrt{K}$ convergence, bias/unbiased scores	(Cortez et al., 2024, Olesker-Taylor et al., 2024)
Bayesian/Variance Dynamics	Adaptive $K$ via Kalman gain, variance shrinkage, online uncertainty	(Szczecinski et al., 2021, Hua et al., 2023)
Macroscopic PDE/Population	Gaussian rating distribution, log growing variance, negative skew	(Fenner et al., 2011, Düring et al., 2018)
Draws, G-Elo, margin-of-victory	Ternary and multi-category extensions, Davidson parameterization	(Szczecinski et al., 2019, Szczecinski, 2020, Moreland et al., 2018)
Tournament Optimization	Spectral gap $\lambda_q$ drives convergence, design for fastest mixing	(Olesker-Taylor et al., 2024)
Empirical Robustness	Superior prediction for sparse data, ranking/win-rate correlation	(Tang et al., 16 Feb 2025)

References

The stochastic analysis of Elo is addressed in depth in the following works:

(Tang et al., 16 Feb 2025) for regret-optimality, misspecification, and empirical reliability,
(Szczecinski et al., 2021, Hua et al., 2023) for Bayesian and variance-updating (Kalman/Laplace) frameworks,
(Cortez et al., 2024, Olesker-Taylor et al., 2024) for Markov chain, stationary law, finite-sample contraction and spectral gap analysis,
(Fenner et al., 2011, Düring et al., 2018) for population-level PDEs and kinetic models,
(Zanco et al., 2022) for round-robin tournament convergence and optimal $K$ -factor design,
(Szczecinski et al., 2019, Szczecinski, 2020, Moreland et al., 2018) for extension to draws, margin-of-victory, and multi-category outcome settings.