Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
GPT-5.1
GPT-5.1 114 tok/s
Gemini 3.0 Pro 53 tok/s Pro
Gemini 2.5 Flash 132 tok/s Pro
Kimi K2 176 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Almost Sure Convergence of Q-Learning

Updated 6 November 2025
  • Almost sure convergence of Q-learning ensures that, under sufficient exploration and diminishing step-sizes, the Q-value iterates converge to the optimal Q* with probability one.
  • It relies on stochastic approximation theory including Robbins-Monro and Robbins-Siegmund conditions to guarantee convergence in finite MDPs and extend to variants like double and smoothed Q-learning.
  • Recent extensions address convergence rates, concentration bounds, and applications in stochastic games and average-reward settings, enhancing the robustness of convergence analyses.

Almost sure convergence of QQ-learning refers to the property that, under specific algorithmic and problem assumptions, the sequence of Q-value iterates produced by the Q-learning update converges to the optimal QQ^* function with probability one. This property has been foundational in the theoretical underpinnings of reinforcement learning and is characterized, extended, and qualified by a diverse family of results addressing various environments, algorithmic modifications, and stochastic approximation structures.

1. Classical Tabular Q-learning: Criteria and Guarantees

For finite Markov Decision Processes (MDPs), the standard Q-learning update is

Qt+1(st,at)=Qt(st,at)+αt(st,at)(rt+γmaxaQt(st+1,a)Qt(st,at)).Q_{t+1}(s_t, a_t) = Q_t(s_t, a_t) + \alpha_t(s_t, a_t) \Big( r_t + \gamma \max_{a'} Q_t(s_{t+1}, a') - Q_t(s_t, a_t) \Big).

The almost sure convergence of QQ-learning to the optimal value QQ^* is guaranteed provided that:

  • Every state-action pair (s,a)(s,a) is visited infinitely often (sufficient exploration);
  • The learning rate sequence satisfies 0αt10 \leq \alpha_t \leq 1, tαt(s,a)=\sum_t \alpha_t(s,a) = \infty, and tαt2(s,a)<\sum_t \alpha_t^2(s,a) < \infty for each (s,a)(s,a);
  • The MDP is finite and all rewards are uniformly bounded.

These are sufficient conditions for stochastic approximation convergence, as originally established in foundational texts and surveys, where convergence is analyzed via contraction mappings and martingale or ODE techniques. The formal statement is: limtQt(s,a)=Q(s,a)with probability one for all (s,a).\lim_{t\to\infty} Q_t(s, a) = Q^*(s, a)\quad \text{with probability one for all } (s,a). Key canonical references are Sutton & Barto (1998), Tsitsiklis (1994), and the elementary constructive proof in (Regehr et al., 2021), which employs only the Robbins–Monro theorem, construction of an augmented "action-replay" process, and explicit error recursion. Proofs generally rely on the Banach contraction property of the Bellman optimality operator, boundedness of Q-values, and diminishing stepsizes.

Variants:

  • Double Q-learning mitigates overestimation bias but under the same assumptions also converges almost surely (Barber, 2023).
  • Smoothed Q-learning (replacing max\max with a smoothing average) achieves almost sure convergence provided the smoothing vanishes asymptotically (i.e., the action distribution qt(as)q_t(a|s) places all its mass on maximizing actions as tt\to\infty) (Barber, 2023, Lee, 20 Apr 2024).

The following table summarizes almost sure convergence results for tabular Q-learning and key variants:

Algorithm Almost Sure Convergence Sufficient Conditions Update Target
Q-learning Yes Sufficient exploration, RM stepsizes max\max over next-state values
Double Q-learning Yes Same as Q-learning Decoupled argmax/value estimation
Smoothed Q-learning Yes Decaying smoothing/exploration; bounded rewards Average/smooth over next-state values, vanishing smoothing

2. Advanced Analysis: Almost Sure Rates and Concentration

Recent work has extended beyond asymptotic convergence to actually proving explicit almost sure convergence rates and associated concentration bounds. The results in (Qian et al., 20 Nov 2024) establish the following under geometrically mixing (ergodic) Markovian noise, contraction of the expected update, and generalized (non-count-based) learning rates:

  • For stepsizes αt=Cα/(t+3)ν\alpha_t = C_\alpha/(t+3)^{\nu}, ν(2/3,1]\nu \in (2/3,1],

limttζqtq2=0 a.s. ζ(0,32ν1).\lim_{t \to \infty} t^\zeta \|q_t - q^*\|^2 = 0 \text{ a.s. } \forall \zeta \in (0, \frac{3}{2}\nu - 1).

  • For logistic/log-linear stepsizes, high-probability confidence bounds valid for all tt (maximal concentration) are also established.

These results substantially generalize previous analyses that were restricted to i.i.d. sampling or required count-based stepsizes, and apply not just to linear but also nonlinear stochastic approximation algorithms.

Key technical tool: a novel "skeleton iterate" discretization allowing drift and noise control over diminishing intervals, enabling the derivation of almost sure rates and uniform (all-time) exponential concentration.

3. Q-learning in Stochastic Games, Non-Discounted, and Average Reward Settings

Two-Player Zero-Sum Games and Stochastic Shortest Path (SSP)

For undiscouted (total cost) stochastic games, almost sure convergence was established in (Yu, 2014) under SSP model conditions:

  • Existence of stationary (possibly randomized) policies that prevent catastrophic outcomes for each player;
  • The dynamic programming equation has a unique solution;
  • Iterates are shown to remain almost surely bounded by construction of auxiliary systems and monotone nonexpansive mapping theory.

The Q-learning update converges almost surely to the solution of the dynamic programming equation, with no assumptions of synchronous or frequent updates.

Two-Timescale Learning and Heterogeneous Step-sizes

For multi-agent learning in zero-sum games, two-timescale Q-learning with player-dependent rates converges almost surely to Nash equilibrium, provided the discount is not too large relative to the step-size heterogeneity: γdαdβ\gamma \leq d_{\alpha} d_{\beta} (see (Sayin et al., 2021)).

Average-Reward Q-learning (Relative Value Iteration and Adaptive Stepsizes)

For the average-reward regime, almost sure convergence is shown in (Wan et al., 29 Aug 2024) for RVI-Q-learning in weakly communicating MDPs. The limit set is characterized: in unichain cases, convergence is to a unique point; in more general cases, to a compact, connected set of solutions to the Bellman equation fixed by an additional normalization.

For asynchronous average-reward Q-learning, almost sure and mean-square last-iterate convergence rates are available with adaptive (per-state-action) learning rates (crucially, not global or uniform), providing O(1/k)O(1/k) bounds in the span seminorm and sup-norm after centering (Chen, 25 Apr 2025). The use of adaptive stepsizes is proven necessary: with uniform stepsize, standard Q-learning can converge to incorrect (non-optimal) fixed points in this setting.

Setting Norm / Mode Guarantee Stepsize Reference
Average-reward, weakly communicating MDP Set convergence a.s. to compact, connected set Diminishing (RM) (Wan et al., 29 Aug 2024)
Average-reward, asynchronous Span / sup-norm Mean-square, a.s. bounded Adaptive per visited (s,a)(s,a) (Chen, 25 Apr 2025)

4. Stochastic Approximation Foundations and Extensions

Convergence of Q-learning is a direct application of stochastic approximation theory. Classic results, e.g., Robbins-Monro, require that:

  • The update variance is bounded;
  • The expected update contracts towards the fixed point;
  • Stepsizes satisfy summability and square-summability.

The Robbins-Siegmund theorem is a central technical tool: If zn+1(1Tn)zn+CTn2z_{n+1} \leq (1 - T_n)z_n + C T_n^2 in expectation given the past and TnT_n satisfies the Robbins-Monro conditions, then zn0z_n \to 0 almost surely.

Recent extensions remove classical limitations where the "noise" is only square-summable (not summable), provided that increments are additionally controlled and rare jumps prevented (Liu et al., 30 Sep 2025). This is instrumental for modern Q-learning analyses under Markovian sampling with realistic step-sizes.

5. Role of Monotonicity, Smoothness, and Function Approximation

Monotonicity of the update is a sufficient (but not necessary) condition for almost sure convergence. In the tabular case, Q-learning forms a monotone numerical scheme for stepsizes 0αt10 \leq \alpha_t \leq 1, guaranteeing almost sure convergence under standard visitation and stepsize conditions (Yang, 30 May 2024).

With function approximation (linear features or neural networks), monotonicity is typically violated: even simple linear approximators can disrupt the required monotonicity unless step-sizes are impractically small or features are bounded. Consequently, almost sure convergence is generally not guaranteed for Q-learning with function approximation (the "deadly triad").

However, for linear Q-learning with an ϵ\epsilon-softmax behavior policy and adaptive temperature, L2L^2 boundedness and convergence to a bounded set are now established, both in expectation (L2L^2 rate) and almost surely (bounded iterates); pathwise convergence to a single point is not claimed and is not always achievable (Liu et al., 31 Jan 2025). Recent extensions of the Robbins-Siegmund theorem under square-summable conditions yield almost sure rates and high-probability concentration for such stochastic approximation processes (Liu et al., 30 Sep 2025).

6. Extensions: Smooth and Regularized Q-learning

Smooth Q-learning variants (e.g., softmax, mellowmax, entropy-regularization) are covered by the unified ODE-based stochastic approximation analysis. When the smoothing (softmax temperature, etc.) vanishes, almost sure convergence to the optimal QQ^* is guaranteed (Lee, 20 Apr 2024, Barber, 2023).

In POMDPs, regularized agent-state-based Q-learning (RASQL), where Q-tables are indexed by agent states (possibly RNN states) and updated via a strongly convex regularizer, converges almost surely to a unique fixed point of a regularized Bellman operator on the agent-state space, under standard exploration and Robbins-Monro step-size conditions. However, the limiting policy is optimal only for the induced "artificial MDP" determined by the agent-state process, and may be suboptimal with respect to the true POMDP (Sinha et al., 29 Aug 2025).

7. Empirical, Distributional, and Practical Considerations

Experience replay, when applied to tabular Q-learning, preserves almost sure convergence provided all state-action pairs continue to be sampled infinitely often and stepsizes decay appropriately (Szlak et al., 2021). However, excessively high replay ratios can degrade convergence rate, while judicious replay is highly effective in environments with rare transitions.

Distributional contractivity frameworks analyze the evolution of Q-learning as a Markov process over distributions of Q-functions. While for constant stepsizes, only convergence in distribution (to a stationary law) is achieved, as stepsizes vanish, this process recovers the classical almost sure convergence pathwise (Amortila et al., 2020).

References and Synthesis Table

Algorithm/Setting Convergence Type Sufficient Conditions Notes/Limitations Key Sources
Standard tabular Q-learning Almost sure (a.s.) Sufficient exploration, RM stepsizes Finite MDP, bounded rewards (Barber, 2023, Regehr et al., 2021, Zhang, 5 Nov 2025)
Double/smoothed Q-learning a.s. As above (+vanishing smoothing for smooth) Slowdown (double Q), smoothing must vanish (Barber, 2023, Lee, 20 Apr 2024)
Stoch. shortest path games a.s. Unique DP solution, monotone/nonexpansive General two-player zero-sum/undiscounted (Yu, 2014)
Average-reward Q-learning a.s. (to set) Adaptive stepsizes, ergodic Markov chain Norm depends on setting; set-valued limit (Wan et al., 29 Aug 2024, Chen, 25 Apr 2025)
Function approximation (linear) a.s. bounded, L2L^2 to set ϵ\epsilon-softmax policy, adaptive temperature Converges to set, not point (Liu et al., 31 Jan 2025, Liu et al., 30 Sep 2025)
Multi-agent (zero-sum) a.s. to Nash eq. Sufficient exploration, stepsize/discount bound Heterogeneity affects allowable discount (Sayin et al., 2021)

The body of knowledge on almost sure convergence of Q-learning is now highly mature in the classical tabular setting, expanding with new quantitative tools for convergence rates, concentration, and extension to function approximation and structural modifications. Results in multi-agent, non-discounted, and average-reward regimes further reinforce the robustness and adaptability of stochastic approximation foundations in RL, with recent formalization efforts marking a new standard of mathematical rigor and machine-verifiable theory (Zhang, 5 Nov 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Almost Sure Convergence of $Q$-Learning.