Almost Sure Convergence of Q-Learning

Updated 6 November 2025

Almost sure convergence of Q-learning ensures that, under sufficient exploration and diminishing step-sizes, the Q-value iterates converge to the optimal Q* with probability one.
It relies on stochastic approximation theory including Robbins-Monro and Robbins-Siegmund conditions to guarantee convergence in finite MDPs and extend to variants like double and smoothed Q-learning.
Recent extensions address convergence rates, concentration bounds, and applications in stochastic games and average-reward settings, enhancing the robustness of convergence analyses.

Almost sure convergence of $Q$ -learning refers to the property that, under specific algorithmic and problem assumptions, the sequence of Q-value iterates produced by the Q-learning update converges to the optimal $Q^*$ function with probability one. This property has been foundational in the theoretical underpinnings of reinforcement learning and is characterized, extended, and qualified by a diverse family of results addressing various environments, algorithmic modifications, and stochastic approximation structures.

1. Classical Tabular Q-learning: Criteria and Guarantees

For finite Markov Decision Processes (MDPs), the standard Q-learning update is

$Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha_t(s_t, a_t) \Big( r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q_t(s_t, a_t) \Big).$

The almost sure convergence of $Q$ -learning to the optimal value $Q^*$ is guaranteed provided that:

Every state-action pair $(s,a)$ is visited infinitely often (sufficient exploration);
The learning rate sequence satisfies $0 \leq \alpha_t \leq 1$ , $\sum_t \alpha_t(s,a) = \infty$ , and $\sum_t \alpha_t^2(s,a) < \infty$ for each $(s,a)$ ;
The MDP is finite and all rewards are uniformly bounded.

These are sufficient conditions for stochastic approximation convergence, as originally established in foundational texts and surveys, where convergence is analyzed via contraction mappings and martingale or ODE techniques. The formal statement is: $\lim_{t\to\infty} Q_t(s, a) = Q^*(s, a)\quad \text{with probability one for all } (s,a).$ Key canonical references are Sutton & Barto (1998), Tsitsiklis (1994), and the elementary constructive proof in (Regehr et al., 2021), which employs only the Robbins–Monro theorem, construction of an augmented "action-replay" process, and explicit error recursion. Proofs generally rely on the Banach contraction property of the Bellman optimality operator, boundedness of Q-values, and diminishing stepsizes.

Variants:

Double Q-learning mitigates overestimation bias but under the same assumptions also converges almost surely (Barber, 2023).
Smoothed Q-learning (replacing $\max$ with a smoothing average) achieves almost sure convergence provided the smoothing vanishes asymptotically (i.e., the action distribution $q_t(a|s)$ places all its mass on maximizing actions as $t\to\infty$ ) (Barber, 2023, Lee, 20 Apr 2024).

The following table summarizes almost sure convergence results for tabular Q-learning and key variants:

Algorithm	Almost Sure Convergence	Sufficient Conditions	Update Target
Q-learning	Yes	Sufficient exploration, RM stepsizes	$\max$ over next-state values
Double Q-learning	Yes	Same as Q-learning	Decoupled argmax/value estimation
Smoothed Q-learning	Yes	Decaying smoothing/exploration; bounded rewards	Average/smooth over next-state values, vanishing smoothing

2. Advanced Analysis: Almost Sure Rates and Concentration

Recent work has extended beyond asymptotic convergence to actually proving explicit almost sure convergence rates and associated concentration bounds. The results in (Qian et al., 20 Nov 2024) establish the following under geometrically mixing (ergodic) Markovian noise, contraction of the expected update, and generalized (non-count-based) learning rates:

For stepsizes $\alpha_t = C_\alpha/(t+3)^{\nu}$ , $\nu \in (2/3,1]$ ,

$\lim_{t \to \infty} t^\zeta \|q_t - q^*\|^2 = 0 \text{ a.s. } \forall \zeta \in (0, \frac{3}{2}\nu - 1).$

For logistic/log-linear stepsizes, high-probability confidence bounds valid for all $t$ (maximal concentration) are also established.

These results substantially generalize previous analyses that were restricted to i.i.d. sampling or required count-based stepsizes, and apply not just to linear but also nonlinear stochastic approximation algorithms.

Key technical tool: a novel "skeleton iterate" discretization allowing drift and noise control over diminishing intervals, enabling the derivation of almost sure rates and uniform (all-time) exponential concentration.

3. Q-learning in Stochastic Games, Non-Discounted, and Average Reward Settings

Two-Player Zero-Sum Games and Stochastic Shortest Path (SSP)

For undiscouted (total cost) stochastic games, almost sure convergence was established in (Yu, 2014) under SSP model conditions:

Existence of stationary (possibly randomized) policies that prevent catastrophic outcomes for each player;
The dynamic programming equation has a unique solution;
Iterates are shown to remain almost surely bounded by construction of auxiliary systems and monotone nonexpansive mapping theory.

The Q-learning update converges almost surely to the solution of the dynamic programming equation, with no assumptions of synchronous or frequent updates.

Two-Timescale Learning and Heterogeneous Step-sizes

For multi-agent learning in zero-sum games, two-timescale Q-learning with player-dependent rates converges almost surely to Nash equilibrium, provided the discount is not too large relative to the step-size heterogeneity: $\gamma \leq d_{\alpha} d_{\beta}$ (see (Sayin et al., 2021)).

Average-Reward Q-learning (Relative Value Iteration and Adaptive Stepsizes)

For the average-reward regime, almost sure convergence is shown in (Wan et al., 29 Aug 2024) for RVI-Q-learning in weakly communicating MDPs. The limit set is characterized: in unichain cases, convergence is to a unique point; in more general cases, to a compact, connected set of solutions to the Bellman equation fixed by an additional normalization.

For asynchronous average-reward Q-learning, almost sure and mean-square last-iterate convergence rates are available with adaptive (per-state-action) learning rates (crucially, not global or uniform), providing $O(1/k)$ bounds in the span seminorm and sup-norm after centering (Chen, 25 Apr 2025). The use of adaptive stepsizes is proven necessary: with uniform stepsize, standard Q-learning can converge to incorrect (non-optimal) fixed points in this setting.

Setting	Norm / Mode	Guarantee	Stepsize	Reference
Average-reward, weakly communicating MDP	Set convergence	a.s. to compact, connected set	Diminishing (RM)	(Wan et al., 29 Aug 2024)
Average-reward, asynchronous	Span / sup-norm	Mean-square, a.s. bounded	Adaptive per visited $(s,a)$	(Chen, 25 Apr 2025)

4. Stochastic Approximation Foundations and Extensions

Convergence of Q-learning is a direct application of stochastic approximation theory. Classic results, e.g., Robbins-Monro, require that:

The update variance is bounded;
The expected update contracts towards the fixed point;
Stepsizes satisfy summability and square-summability.

The Robbins-Siegmund theorem is a central technical tool: If $z_{n+1} \leq (1 - T_n)z_n + C T_n^2$ in expectation given the past and $T_n$ satisfies the Robbins-Monro conditions, then $z_n \to 0$ almost surely.

Recent extensions remove classical limitations where the "noise" is only square-summable (not summable), provided that increments are additionally controlled and rare jumps prevented (Liu et al., 30 Sep 2025). This is instrumental for modern Q-learning analyses under Markovian sampling with realistic step-sizes.

5. Role of Monotonicity, Smoothness, and Function Approximation

Monotonicity of the update is a sufficient (but not necessary) condition for almost sure convergence. In the tabular case, Q-learning forms a monotone numerical scheme for stepsizes $0 \leq \alpha_t \leq 1$ , guaranteeing almost sure convergence under standard visitation and stepsize conditions (Yang, 30 May 2024).

With function approximation (linear features or neural networks), monotonicity is typically violated: even simple linear approximators can disrupt the required monotonicity unless step-sizes are impractically small or features are bounded. Consequently, almost sure convergence is generally not guaranteed for Q-learning with function approximation (the "deadly triad").

However, for linear Q-learning with an $\epsilon$ -softmax behavior policy and adaptive temperature, $L^2$ boundedness and convergence to a bounded set are now established, both in expectation ( $L^2$ rate) and almost surely (bounded iterates); pathwise convergence to a single point is not claimed and is not always achievable (Liu et al., 31 Jan 2025). Recent extensions of the Robbins-Siegmund theorem under square-summable conditions yield almost sure rates and high-probability concentration for such stochastic approximation processes (Liu et al., 30 Sep 2025).

6. Extensions: Smooth and Regularized Q-learning

Smooth Q-learning variants (e.g., softmax, mellowmax, entropy-regularization) are covered by the unified ODE-based stochastic approximation analysis. When the smoothing (softmax temperature, etc.) vanishes, almost sure convergence to the optimal $Q^*$ is guaranteed (Lee, 20 Apr 2024, Barber, 2023).

In POMDPs, regularized agent-state-based Q-learning (RASQL), where Q-tables are indexed by agent states (possibly RNN states) and updated via a strongly convex regularizer, converges almost surely to a unique fixed point of a regularized Bellman operator on the agent-state space, under standard exploration and Robbins-Monro step-size conditions. However, the limiting policy is optimal only for the induced "artificial MDP" determined by the agent-state process, and may be suboptimal with respect to the true POMDP (Sinha et al., 29 Aug 2025).

7. Empirical, Distributional, and Practical Considerations

Experience replay, when applied to tabular Q-learning, preserves almost sure convergence provided all state-action pairs continue to be sampled infinitely often and stepsizes decay appropriately (Szlak et al., 2021). However, excessively high replay ratios can degrade convergence rate, while judicious replay is highly effective in environments with rare transitions.

Distributional contractivity frameworks analyze the evolution of Q-learning as a Markov process over distributions of Q-functions. While for constant stepsizes, only convergence in distribution (to a stationary law) is achieved, as stepsizes vanish, this process recovers the classical almost sure convergence pathwise (Amortila et al., 2020).

References and Synthesis Table

Algorithm/Setting	Convergence Type	Sufficient Conditions	Notes/Limitations	Key Sources
Standard tabular Q-learning	Almost sure (a.s.)	Sufficient exploration, RM stepsizes	Finite MDP, bounded rewards	(Barber, 2023, Regehr et al., 2021, Zhang, 5 Nov 2025)
Double/smoothed Q-learning	a.s.	As above (+vanishing smoothing for smooth)	Slowdown (double Q), smoothing must vanish	(Barber, 2023, Lee, 20 Apr 2024)
Stoch. shortest path games	a.s.	Unique DP solution, monotone/nonexpansive	General two-player zero-sum/undiscounted	(Yu, 2014)
Average-reward Q-learning	a.s. (to set)	Adaptive stepsizes, ergodic Markov chain	Norm depends on setting; set-valued limit	(Wan et al., 29 Aug 2024, Chen, 25 Apr 2025)
Function approximation (linear)	a.s. bounded, $L^2$ to set	$\epsilon$ -softmax policy, adaptive temperature	Converges to set, not point	(Liu et al., 31 Jan 2025, Liu et al., 30 Sep 2025)
Multi-agent (zero-sum)	a.s. to Nash eq.	Sufficient exploration, stepsize/discount bound	Heterogeneity affects allowable discount	(Sayin et al., 2021)

The body of knowledge on almost sure convergence of Q-learning is now highly mature in the classical tabular setting, expanding with new quantitative tools for convergence rates, concentration, and extension to function approximation and structural modifications. Results in multi-agent, non-discounted, and average-reward regimes further reinforce the robustness and adaptability of stochastic approximation foundations in RL, with recent formalization efforts marking a new standard of mathematical rigor and machine-verifiable theory (Zhang, 5 Nov 2025).