RVI Q-Learning

Updated 9 December 2025

RVI Q-Learning is a model-free stochastic approximation method for average-reward MDPs and SMDPs, explicitly using a reference baseline to ensure convergence.
The algorithm employs asynchronous updates in both tabular and function-approximation settings, improving stability in weakly communicating and unichain environments.
Extensions to risk-aware, robust, and deep reinforcement learning demonstrate its practical efficiency and theoretical soundness in managing uncertainty.

Relative Value Iteration (RVI) Q-Learning is the canonical model-free stochastic approximation analogue of Schweitzer’s classical relative value iteration for average-reward Markov Decision Processes (MDPs) and semi-Markov decision processes (SMDPs). It is distinguished by its explicit use of a reference functional to “pin” the action-value functions, ensuring stability and almost-sure convergence in weakly communicating and unichain MDPs and SMDPs, including settings with risk sensitivity and robustness. RVI Q-Learning admits extensive theoretical guarantees and algorithmic extensions, notably asynchronous update schemes and compatibility with both tabular and function-approximation settings.

1. Foundations: Average-Reward Optimality Equation and Shift-Invariance

The average-reward problem in finite MDPs seeks a policy π maximizing the long-run average reward: $\beta^* = \max_{\pi} \liminf_{N\to\infty} \mathbb{E}\left[\frac{1}{N} \sum_{n=0}^{N-1} r(X_n, U_n)\right].$ Unlike the discounted setting, the Bellman optimality equations for the average-reward criterion admit solution sets not unique up to shifts—especially in weakly communicating MDPs where multiple recurrent classes may exist under optimal policies. Given the optimal average reward $r_*$ and some action-value function $q^*(s,a)$ , the fundamental average-reward optimality equation is

$q^*(s,a) = r(s,a) - r_* + \sum_{s'} p_{ss'}^a \max_{a'} q^*(s',a').$

Because any additive shift of $q^*$ is again a solution unless constrained, RVI Q-Learning enforces a reference constraint (through a shift-invariant, homogeneous, and Lipschitz function $f$ ), thereby selecting a unique member (or a section) of the solution set (Wan et al., 29 Aug 2024, Wan et al., 2022).

2. The RVI Q-Learning Algorithm and Its Update Rule

RVI Q-Learning iteratively adjusts Q-values by a temporal-difference error, with each update subtracting a running baseline $f(Q_n)$ : $Q_{n+1}(s,a) = Q_n(s,a) + \alpha_{\nu_n(s,a)} \left[R_{n+1}^{s,a} - f(Q_n) + \max_{a'} Q_n(s',a') - Q_n(s,a)\right].$ All other entries stay unchanged in the asynchronous setting. The baseline $f(Q_n)$ is commonly chosen as the value at a fixed “reference” state-action pair (e.g., $Q(s_0,a_0)$ ), but more general forms (e.g., convex combinations, max, or mean) are permitted as long as they are strictly increasing under scalar translation (SISTr), Lipschitz, and homogeneous (Wan et al., 29 Aug 2024, Yu et al., 5 Sep 2024, Yu et al., 5 Dec 2025).

The update admits tabular and function-approximation versions and requires only single-sample transition and reward observations under arbitrary (sufficiently exploratory) behavior policies. In SMDPs, the RVI update normalizes by an estimated holding time: $Q_{n+1}(s,a) = Q_n(s,a) + \alpha_{\nu_n(s,a)} \left(\frac{R_{n+1}^{sa}+\max_{a'}Q_n(S_{n+1}^{sa},a')-Q_n(s,a)}{T_n(s,a)\vee \eta_n} - f(Q_n)\right).$ Here $T_n(s,a)$ , the running estimate of the holding time for $(s,a)$ , is updated by stochastic gradient descent (Yu et al., 5 Dec 2025, Yu et al., 5 Sep 2024).

3. Convergence Analysis and Solution Set Structure

RVI Q-Learning is underpinned by the asynchronous stochastic approximation (SA) framework. Under appropriate step-size conditions ( $\sum_n\alpha_n=\infty$ , $\sum_n\alpha_n^2<\infty$ ), sufficiently frequent updates for each $(s,a)$ , weakly communicating dynamics, and a reference function $f$ satisfying SISTr, the iterates converge almost surely to a compact, connected subset of the Bellman equation solution set subject to $f(q)=r_*$ (Wan et al., 29 Aug 2024, Yu et al., 5 Dec 2025, Wan et al., 2022, Yu et al., 5 Sep 2024).

For general weakly communicating MDPs (multiple recurrent classes), the limiting set $Q_\infty$ is homeomorphic to an $(n^*-1)$ -dimensional convex polyhedron, where $n^*$ is the minimal number of recurrent classes under optimal policies. Imposition of $f(q)=r_*$ removes exactly one degree of freedom, yielding a well-characterized but not necessarily singleton solution set (Wan et al., 29 Aug 2024).

Advanced SA theory, including Borkar–Meyn stability criteria and ODE shadowing arguments, ensures that, under stricter step-size/asynchrony conditions, every path converges to a unique, sample-path-dependent point in this set (Yu et al., 5 Sep 2024, Yu et al., 5 Dec 2025).

4. Reference Choices, Asynchrony, and Extensions

The choice of $f$ is both a practical and theoretical consideration. The mapping $f$ must be strictly increasing under shift (SISTr) and Lipschitz to guarantee stability and avoid unbounded drift. Common choices include

Reference $f(Q)$	Properties/Note
$Q(s_0,a_0)$	Anchors a specific pair
$\sum_{s,a}\eta_{s,a}Q(s,a)$	Convex combination
$\max_{s,a} Q(s,a)$	Upper anchoring
$\frac{1}{\|S\|\|A\|}\sum Q$	Global average

In SMDPs, careful adjustment for holding times is required— $Q$ -updates are normalized by the expected mean holding time, itself estimated online (Yu et al., 5 Dec 2025, Yu et al., 5 Sep 2024).

RVI Q-Learning is robust to asynchrony: updates may be distributed arbitrarily across state-action pairs (as long as each is visited infinitely often), facilitating scalable and parallel implementations. Extensions exist for the options framework in hierarchical RL, with inter- and intra-option RVI Q-Learning updates also converging to Bellman-optimal solution submanifolds (Wan et al., 29 Aug 2024).

5. Risk-Aware and Robust RVI Q-Learning

RVI Q-Learning generalizes naturally to risk-aware and robust settings. For dynamic risk measures or robust control (under, e.g., contamination, total-variation, KL, Wasserstein ambiguity sets), the core update is modified so that the Bellman operator incorporates worst-case or risk-adjusted next-value distributions. Nonetheless, the baseline subtraction $-f(Q_n)$ is retained, and the convergence arguments carry through to these settings under appropriate generalized operator properties and unbiasedness of inner estimators (Wang et al., 22 Mar 2025, Wang et al., 2023).

For example, robust RVI Q-Learning employs an inner “support function” $\sigma_{P^a_s}(V)$ capturing the worst-case expected next value, with unbiased multilevel Monte Carlo estimators used when exact computation is impractical (Wang et al., 2023, Wang et al., 22 Mar 2025).

6. Function Approximation and Empirical Results

RVI Q-Learning can be extended to function approximation, both linear (e.g., $Q_w(s,a) = w^\top \phi(s,a)$ ) and nonlinear (e.g., deep Q-networks). In linear settings, RVI-style batch updates have been combined with Bayesian least-squares and posterior sampling, yielding improved exploratory efficiency over $\epsilon$ -greedy (as observed in the “Angrier Birds” experiments) (Ibarra et al., 2016).

In nonlinear, deep RL settings, architectures such as Full-Gradient DQN minimize Bellman residuals with running gain subtraction, increasing stability and convergence speed compared to discounted or semi-gradient DQN variants (Pagare et al., 2023).

Empirical evaluations demonstrate that RVI Q-Learning achieves faster convergence and better long-run average reward performance than alternatives when sufficient exploration and proper reference selection are enforced. In risk-aware and robust settings, model-free RVI Q-Learning delivers convergence to risk-sensitive or robustly optimal policies, with manageable per-iteration complexity (Pagare et al., 2023, Wang et al., 2023, Wang et al., 22 Mar 2025, Ibarra et al., 2016).

7. Practical Implementation and Extensions

Key practical guidelines include:

Use of step-sizes $\alpha_n$ decreasing but not too rapidly, e.g., $\alpha_n = 1/(n+1)^\beta$ with $\beta \in (0.5,1]$ .
Asynchronous patterns where every $(s,a)$ is updated with positive limiting frequency.
Computation or estimation of holding times $T_n(s,a)$ in SMDPs.
Rigorous choice of $f$ with verified SISTr and Lipschitz properties, ensuring the baseline subtraction self-regulates the value scale.
Plug-in of unbiased risk or robust estimators when handling non-linear Bellman operators.

Ongoing research directions include advanced distributed asynchronous variants, function approximation in large or continuous spaces, relaxing shift-invariance properties, analyzing convergence in more general nonconvex solution sets, and empirical benchmarks in demanding RL domains (Yu et al., 5 Dec 2025, Yu et al., 5 Sep 2024, Wan et al., 29 Aug 2024, Wan et al., 2022).

References:

(Ibarra et al., 2016, Wan et al., 2022, Pagare et al., 2023, Wang et al., 2023, Wan et al., 29 Aug 2024, Yu et al., 5 Sep 2024, Wang et al., 22 Mar 2025, Yu et al., 5 Dec 2025)