Almost Sure Convergence of Q-Learning
- Almost sure convergence of Q-learning ensures that, under sufficient exploration and diminishing step-sizes, the Q-value iterates converge to the optimal Q* with probability one.
- It relies on stochastic approximation theory including Robbins-Monro and Robbins-Siegmund conditions to guarantee convergence in finite MDPs and extend to variants like double and smoothed Q-learning.
- Recent extensions address convergence rates, concentration bounds, and applications in stochastic games and average-reward settings, enhancing the robustness of convergence analyses.
Almost sure convergence of -learning refers to the property that, under specific algorithmic and problem assumptions, the sequence of Q-value iterates produced by the Q-learning update converges to the optimal function with probability one. This property has been foundational in the theoretical underpinnings of reinforcement learning and is characterized, extended, and qualified by a diverse family of results addressing various environments, algorithmic modifications, and stochastic approximation structures.
1. Classical Tabular Q-learning: Criteria and Guarantees
For finite Markov Decision Processes (MDPs), the standard Q-learning update is
The almost sure convergence of -learning to the optimal value is guaranteed provided that:
- Every state-action pair is visited infinitely often (sufficient exploration);
- The learning rate sequence satisfies , , and for each ;
- The MDP is finite and all rewards are uniformly bounded.
These are sufficient conditions for stochastic approximation convergence, as originally established in foundational texts and surveys, where convergence is analyzed via contraction mappings and martingale or ODE techniques. The formal statement is: Key canonical references are Sutton & Barto (1998), Tsitsiklis (1994), and the elementary constructive proof in (Regehr et al., 2021), which employs only the Robbins–Monro theorem, construction of an augmented "action-replay" process, and explicit error recursion. Proofs generally rely on the Banach contraction property of the Bellman optimality operator, boundedness of Q-values, and diminishing stepsizes.
Variants:
- Double Q-learning mitigates overestimation bias but under the same assumptions also converges almost surely (Barber, 2023).
- Smoothed Q-learning (replacing with a smoothing average) achieves almost sure convergence provided the smoothing vanishes asymptotically (i.e., the action distribution places all its mass on maximizing actions as ) (Barber, 2023, Lee, 20 Apr 2024).
The following table summarizes almost sure convergence results for tabular Q-learning and key variants:
| Algorithm | Almost Sure Convergence | Sufficient Conditions | Update Target |
|---|---|---|---|
| Q-learning | Yes | Sufficient exploration, RM stepsizes | over next-state values |
| Double Q-learning | Yes | Same as Q-learning | Decoupled argmax/value estimation |
| Smoothed Q-learning | Yes | Decaying smoothing/exploration; bounded rewards | Average/smooth over next-state values, vanishing smoothing |
2. Advanced Analysis: Almost Sure Rates and Concentration
Recent work has extended beyond asymptotic convergence to actually proving explicit almost sure convergence rates and associated concentration bounds. The results in (Qian et al., 20 Nov 2024) establish the following under geometrically mixing (ergodic) Markovian noise, contraction of the expected update, and generalized (non-count-based) learning rates:
- For stepsizes , ,
- For logistic/log-linear stepsizes, high-probability confidence bounds valid for all (maximal concentration) are also established.
These results substantially generalize previous analyses that were restricted to i.i.d. sampling or required count-based stepsizes, and apply not just to linear but also nonlinear stochastic approximation algorithms.
Key technical tool: a novel "skeleton iterate" discretization allowing drift and noise control over diminishing intervals, enabling the derivation of almost sure rates and uniform (all-time) exponential concentration.
3. Q-learning in Stochastic Games, Non-Discounted, and Average Reward Settings
Two-Player Zero-Sum Games and Stochastic Shortest Path (SSP)
For undiscouted (total cost) stochastic games, almost sure convergence was established in (Yu, 2014) under SSP model conditions:
- Existence of stationary (possibly randomized) policies that prevent catastrophic outcomes for each player;
- The dynamic programming equation has a unique solution;
- Iterates are shown to remain almost surely bounded by construction of auxiliary systems and monotone nonexpansive mapping theory.
The Q-learning update converges almost surely to the solution of the dynamic programming equation, with no assumptions of synchronous or frequent updates.
Two-Timescale Learning and Heterogeneous Step-sizes
For multi-agent learning in zero-sum games, two-timescale Q-learning with player-dependent rates converges almost surely to Nash equilibrium, provided the discount is not too large relative to the step-size heterogeneity: (see (Sayin et al., 2021)).
Average-Reward Q-learning (Relative Value Iteration and Adaptive Stepsizes)
For the average-reward regime, almost sure convergence is shown in (Wan et al., 29 Aug 2024) for RVI-Q-learning in weakly communicating MDPs. The limit set is characterized: in unichain cases, convergence is to a unique point; in more general cases, to a compact, connected set of solutions to the Bellman equation fixed by an additional normalization.
For asynchronous average-reward Q-learning, almost sure and mean-square last-iterate convergence rates are available with adaptive (per-state-action) learning rates (crucially, not global or uniform), providing bounds in the span seminorm and sup-norm after centering (Chen, 25 Apr 2025). The use of adaptive stepsizes is proven necessary: with uniform stepsize, standard Q-learning can converge to incorrect (non-optimal) fixed points in this setting.
| Setting | Norm / Mode | Guarantee | Stepsize | Reference |
|---|---|---|---|---|
| Average-reward, weakly communicating MDP | Set convergence | a.s. to compact, connected set | Diminishing (RM) | (Wan et al., 29 Aug 2024) |
| Average-reward, asynchronous | Span / sup-norm | Mean-square, a.s. bounded | Adaptive per visited | (Chen, 25 Apr 2025) |
4. Stochastic Approximation Foundations and Extensions
Convergence of Q-learning is a direct application of stochastic approximation theory. Classic results, e.g., Robbins-Monro, require that:
- The update variance is bounded;
- The expected update contracts towards the fixed point;
- Stepsizes satisfy summability and square-summability.
The Robbins-Siegmund theorem is a central technical tool: If in expectation given the past and satisfies the Robbins-Monro conditions, then almost surely.
Recent extensions remove classical limitations where the "noise" is only square-summable (not summable), provided that increments are additionally controlled and rare jumps prevented (Liu et al., 30 Sep 2025). This is instrumental for modern Q-learning analyses under Markovian sampling with realistic step-sizes.
5. Role of Monotonicity, Smoothness, and Function Approximation
Monotonicity of the update is a sufficient (but not necessary) condition for almost sure convergence. In the tabular case, Q-learning forms a monotone numerical scheme for stepsizes , guaranteeing almost sure convergence under standard visitation and stepsize conditions (Yang, 30 May 2024).
With function approximation (linear features or neural networks), monotonicity is typically violated: even simple linear approximators can disrupt the required monotonicity unless step-sizes are impractically small or features are bounded. Consequently, almost sure convergence is generally not guaranteed for Q-learning with function approximation (the "deadly triad").
However, for linear Q-learning with an -softmax behavior policy and adaptive temperature, boundedness and convergence to a bounded set are now established, both in expectation ( rate) and almost surely (bounded iterates); pathwise convergence to a single point is not claimed and is not always achievable (Liu et al., 31 Jan 2025). Recent extensions of the Robbins-Siegmund theorem under square-summable conditions yield almost sure rates and high-probability concentration for such stochastic approximation processes (Liu et al., 30 Sep 2025).
6. Extensions: Smooth and Regularized Q-learning
Smooth Q-learning variants (e.g., softmax, mellowmax, entropy-regularization) are covered by the unified ODE-based stochastic approximation analysis. When the smoothing (softmax temperature, etc.) vanishes, almost sure convergence to the optimal is guaranteed (Lee, 20 Apr 2024, Barber, 2023).
In POMDPs, regularized agent-state-based Q-learning (RASQL), where Q-tables are indexed by agent states (possibly RNN states) and updated via a strongly convex regularizer, converges almost surely to a unique fixed point of a regularized Bellman operator on the agent-state space, under standard exploration and Robbins-Monro step-size conditions. However, the limiting policy is optimal only for the induced "artificial MDP" determined by the agent-state process, and may be suboptimal with respect to the true POMDP (Sinha et al., 29 Aug 2025).
7. Empirical, Distributional, and Practical Considerations
Experience replay, when applied to tabular Q-learning, preserves almost sure convergence provided all state-action pairs continue to be sampled infinitely often and stepsizes decay appropriately (Szlak et al., 2021). However, excessively high replay ratios can degrade convergence rate, while judicious replay is highly effective in environments with rare transitions.
Distributional contractivity frameworks analyze the evolution of Q-learning as a Markov process over distributions of Q-functions. While for constant stepsizes, only convergence in distribution (to a stationary law) is achieved, as stepsizes vanish, this process recovers the classical almost sure convergence pathwise (Amortila et al., 2020).
References and Synthesis Table
| Algorithm/Setting | Convergence Type | Sufficient Conditions | Notes/Limitations | Key Sources |
|---|---|---|---|---|
| Standard tabular Q-learning | Almost sure (a.s.) | Sufficient exploration, RM stepsizes | Finite MDP, bounded rewards | (Barber, 2023, Regehr et al., 2021, Zhang, 5 Nov 2025) |
| Double/smoothed Q-learning | a.s. | As above (+vanishing smoothing for smooth) | Slowdown (double Q), smoothing must vanish | (Barber, 2023, Lee, 20 Apr 2024) |
| Stoch. shortest path games | a.s. | Unique DP solution, monotone/nonexpansive | General two-player zero-sum/undiscounted | (Yu, 2014) |
| Average-reward Q-learning | a.s. (to set) | Adaptive stepsizes, ergodic Markov chain | Norm depends on setting; set-valued limit | (Wan et al., 29 Aug 2024, Chen, 25 Apr 2025) |
| Function approximation (linear) | a.s. bounded, to set | -softmax policy, adaptive temperature | Converges to set, not point | (Liu et al., 31 Jan 2025, Liu et al., 30 Sep 2025) |
| Multi-agent (zero-sum) | a.s. to Nash eq. | Sufficient exploration, stepsize/discount bound | Heterogeneity affects allowable discount | (Sayin et al., 2021) |
The body of knowledge on almost sure convergence of Q-learning is now highly mature in the classical tabular setting, expanding with new quantitative tools for convergence rates, concentration, and extension to function approximation and structural modifications. Results in multi-agent, non-discounted, and average-reward regimes further reinforce the robustness and adaptability of stochastic approximation foundations in RL, with recent formalization efforts marking a new standard of mathematical rigor and machine-verifiable theory (Zhang, 5 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free