Average-Reward Semi-Markov Decision Processes

Updated 9 December 2025

Average-reward SMDPs are defined by variable sojourn times, finite state/action spaces, and a long-run average reward maximization strategy using a Bellman-type optimality equation.
Algorithms like Optimal Nudging and asynchronous RVI Q-learning tackle the fractional optimization structure with proven convergence under weakly communicating or unichain assumptions.
Hierarchical methods and options-based approaches extend SMDPs to address temporal abstraction in reinforcement learning, enhancing scalability and practical performance.

Average-reward semi-Markov decision processes (SMDPs) generalize Markov decision processes by incorporating variable, possibly unbounded sojourn times between transitions and modeling the agent's objective as the maximization of long-run average reward per unit time. This formulation is suited to control and reinforcement learning problems that are not naturally discretized in fixed time intervals, particularly in settings with temporal abstraction, options, or asynchronous events. The study and algorithmic solution of average-reward SMDPs has evolved through analysis of fractional programming, stochastic approximation, and options theory, culminating in a suite of provably-convergent, scalable learning methods supported by rigorous convergence results.

1. Formal Definition and Structural Properties

An average-reward SMDP is defined by a finite state set $S$ , finite action set $A$ , and for each $(s,a)$ , a probability distribution $P_{sa}$ over next state $S'$ , holding time (sojourn) $\tau \ge 0$ , and reward $R$ . The expected immediate reward $r_{sa} = E_{sa}[R]$ , expected holding time $t_{sa} = E_{sa}[\tau]$ , and transition probabilities $p_{ss'}^a$ fully characterize the process (Yu et al., 5 Dec 2025, Yu et al., 5 Sep 2024, Wan et al., 29 Aug 2024).

A (possibly stochastic) policy $\pi$ specifies action choices, possibly conditioned on entire history or only current state for stationary (Markov) policies.
Under $\pi$ , the infinite-horizon performance is measured by the average reward rate:

$r(\pi, s) = \liminf_{t \to \infty} \frac{E^π_s\left[ \sum_{n=1}^{N_t} R_n \right]}{t}$

where $N_t$ is the maximal number of transitions by time $t$ .

Standard weakly communicating or unichain assumptions guarantee that the optimal rate $ρ^* = \sup_π r(\pi,s)$ is independent of $s$ and that an optimal stationary policy exists (Yu et al., 5 Dec 2025). The long-run optimality is characterized by the average-reward optimality equation (AOE), expressing a Bellman-type fixed point for bias/differential value (state or action value):

State value:

$ρ^* + h(i) = \max_{a \in A(i)} \sum_{j} p_{ij}(a)[r_{ij}(a) - ρ^* t_{ij}(a) + h(j)]$

Action value:

$q(i,a) = r_{ia} - t_{ia} ρ^* + \sum_{j} p_{ij}(a) \max_{b} q(j, b)$

The solution set to the AOE is unique up to constant shifts in the bias for unichain models, but in weakly communicating SMDPs, it comprises a (generally nonconvex) compact, connected set, homeomorphic to a polyhedron of dimension $n^*-1$ , where $n^*$ is the minimal number of recurrent classes among optimal stationary policies (Wan et al., 29 Aug 2024).

2. Algorithms for Average-Reward SMDPs

2.1 Optimal Nudging

Optimal Nudging addresses the fractional structure of the objective by iteratively reducing gain-maximization $\arg\max_π v^π/c^π$ to a sequence of cumulative-reward tasks of the form $\arg\max_π[v^π - ρ_i c^π]$ , where $v^π$ and $c^π$ are the total expected reward and cost, and $ρ$ is the gain parameter (Muriel et al., 2015).

At each iteration, the algorithm fixes $ρ$ , translates the problem to a cumulative-reward MDP (reward $r - ρ k$ ), and applies any reinforcement learning or dynamic programming “black box.”
The gain $ρ$ is updated via a minimax geometric rule based on the $w$ – $l$ mapping of policies, shrinking the interval containing $ρ^*$ .
Termination is triggered by a sign change in the total return, indicating that $ρ^*$ has been bracketed to desired precision.
The number of cumulative-reward calls is $O(\log(D/\epsilon))$ for final precision $\epsilon$ , with practical performance improved by early termination and value transfer across iterations.

This approach requires no gain step-size tuning and is agnostic to the cumulative-reward solver used.

2.2 Relative Value Iteration Q-Learning and Stochastic Approximation

Model-free reinforcement learning in average-reward SMDPs is dominated by asynchronous stochastic approximation (SA) variants of relative value iteration (RVI) Q-learning (Yu et al., 5 Dec 2025, Wan et al., 29 Aug 2024, Yu et al., 5 Sep 2024). These updates generalize Abounadi–Bertsekas–Borkar’s algorithm for MDPs to the SMDP and options regimes and have established almost-sure convergence.

The central recursion maintains $Q_n(s,a)$ and $T_n(s,a)$ (for mean holding times), updated from empirical samples $(S_{n+1}, \tau_{n+1}, R_{n+1})$ :

$Q_{n+1}(s,a) = Q_n(s,a) + \alpha_{ν(n,s,a)}\left[ \frac{R_{n+1} + \max_{a'} Q_n(S_{n+1}, a') - Q_n(s,a)}{T_n(s,a) \vee η_n} - f(Q_n) \right]$

$T_{n+1}(s,a) = T_n(s,a) + \beta_{ν(n,s,a)}(\tau_{n+1} - T_n(s,a))$

The function $f(\cdot)$ normalizes $Q$ and estimates $ρ^*$ . It must be Lipschitz and strictly increasing under scalar translation (SISTr); affine choices and nonlinear functionals are both permitted (Yu et al., 5 Dec 2025, Yu et al., 5 Sep 2024).

Convergence holds under standard assumptions: diminishing stepsizes ( $\sum \alpha_k = \infty, \sum \alpha_k^2 < \infty$ ), partial asynchrony (every $(s,a)$ updated infinitely often), bounded reward/holding time second moments.
The limit set of $Q_n$ is a compact, connected subset of the solution set of the AOE with $f(Q) = ρ^*$ ; all greedy policies are optimal. Under additional stepsize decay and bias control conditions, almost-sure convergence to a unique equilibrium holds (Yu et al., 5 Dec 2025).

2.3 Hierarchical and Options-Based Methods

Hierarchical RL via options induces an SMDP at the option level. Both inter-option and intra-option differential Q-learning methods, including off-policy variants, are supported by stochastic approximation convergence results (Wan et al., 2021, Wan et al., 29 Aug 2024).

Inter-option learning updates $Q(s, \omega)$ and $L(s, \omega)$ (with $\omega$ an option), based on sampled option trajectories.
Intra-option learning enables credit assignment and model estimation at the level of primitive actions within options, leveraging off-policy data and eligibility trace structures.
Option-interrupting behavior, where options are terminated early based on greedy re-evaluation, improves reward rate and is theoretically justified.

Model-based variants use sample-based planning (Dyna-style) and recursive TD updates for model estimation, converging to the fixed points of their respective Bellman equations.

3. Theoretical Convergence Analysis

The convergence of average-reward SMDP learning algorithms relies on coupling stochastic approximation with dynamical systems (ODE) analysis (Yu et al., 5 Dec 2025, Wan et al., 29 Aug 2024, Yu et al., 5 Sep 2024):

Asynchronous Borkar–Meyn stability guarantees establish boundedness of $Q_n$ sequences via scaling limit ODEs that have globally attracting origin.
The main mean-field ODE is of the form $\dot x(t) = h(x(t)) = T(x) - x - f(x)$ , with $T$ the Bellman operator.
Uniqueness and translation invariance of the solution set are enforced by the SISTr property for $f$ .
For weakly communicating SMDPs, the limiting set is a connected, compact polyhedron, with each element’s greedy policy being optimal. For unichain models, the limit (modulo constants) is unique.
The Benaïm–Hirsch shadowing arguments ensure that stochastic iterates track the ODE solution set, giving almost-sure convergence in the presence of noise.

4. Implications for Practical Reinforcement Learning

Empirical and theoretical findings highlight several features relevant to large-scale or hierarchical RL:

Algorithm	Sample/Iteration Complexity	Gain Step-Size Tuning	Structure Supported
Optimal Nudging (Muriel et al., 2015)	$O(\log D/\epsilon)$ MDP calls	No	Complete SMDP
RVI Q-Learning (Yu et al., 5 Dec 2025, Wan et al., 29 Aug 2024)	$O(1)$ per update; scalable	No	SMDP, hierarchical
Differential Q (Wan et al., 2021)	$O(1)$ per update; asynchronous update	One gain step-size	SMDP, options

These methods avoid gain step-size tuning in their most robust forms.
Asynchronous, online updates and local learning allow scalability to large state-action spaces, as updates are made for visited state–action (or state–option) tuples only.
Degrees of freedom in the value solution for weakly communicating SMDPs do not affect policy optimality, as all greedy policies with respect to any limit point are optimal.

5. Extensions: Hierarchical SMDPs and Option Learning

The extension of SMDP-based average-reward learning to include temporally extended actions (“options”) enables multi-scale temporal abstraction (Wan et al., 2021, Wan et al., 29 Aug 2024):

The options framework is formalized as an SMDP with augmented action set $\Omega$ (options), each equipped with an intra-option policy and termination rule.
The overall RL problem is specified via option-level reward and duration kernels, supporting both inter-option and intra-option updates.
Option-interrupting policies, which terminate ongoing options when an alternative’s expected value is better, provably increase or maintain the average reward rate.
Sample-based planning algorithms iteratively simulate option-level or action-level transitions to update value and model estimates, enabling Dyna-style acceleration of learning.

6. Methodological Foundations and Recent Advancements

The rigorous convergence of reinforcement learning in average-reward SMDPs has been established only recently via technical advances in asynchronous stochastic approximation, most notably the generalization of Borkar–Meyn stability analysis and the introduction of monotonicity (SISTr) conditions for the optimal reward estimator in the learning updates (Yu et al., 5 Dec 2025, Yu et al., 5 Sep 2024):

The convergence of asynchronous SA algorithms is ensured even under relaxed conditions for update schedules and noise.
The key to stability and non-degeneracy of the limit set is the SISTr property, which enables translation uniqueness in the bias value function and supports both affine and nonlinear normalizations.
The structure of the solution set is now characterized: for weakly communicating models, the set of Q-functions to which the algorithms converge is a compact, connected polyhedron with dimension matching the minimal number of recurrent classes under optimal policies minus one (Wan et al., 29 Aug 2024).
These findings subsume classical approaches and extend theoretical support to hierarchical RL, temporal abstraction, and planning with average reward (Wan et al., 2021).

7. Summary and Outlook

Research over the past decade has produced a precise formulation and mature algorithmic toolkit for average-reward SMDPs, incorporating both model-based and model-free, synchronous and asynchronous, and hierarchically structured methods. The convergence and practical utility of these methods are supported by detailed ODE-based analysis, covering a broad model class beyond traditional unichain MDPs. Remaining directions include sharper, problem-dependent sample complexity bounds, tighter characterization of convergence rates, and the extension of these techniques to continuous state/action or partially observed semi-Markov settings.

Key references: (Muriel et al., 2015, Yu et al., 5 Dec 2025, Wan et al., 29 Aug 2024, Wan et al., 2021, Yu et al., 5 Sep 2024).