Average-Reward/Cost Criterion in Markov Processes

Updated 5 July 2025

The average-reward/cost criterion is a framework for evaluating steady-state per-step rewards or costs in infinite-horizon Markov and partially observable decision processes.
It employs methods like relative value iteration and policy gradient to solve for optimal policies when discounting is inappropriate, enabling robust and scalable solutions.
Key applications span operations research, robotics, and control systems where continuous performance evaluation and risk management are critical, with solid theoretical guarantees on convergence and complexity.

The average-reward (or average-cost) criterion is a central concept in the theory and practice of Markov decision processes (MDPs), partially observable MDPs (POMDPs), reinforcement learning (RL), and control. It provides a framework for assessing the long-term performance of policies in infinite-horizon, continuing environments by focusing on steady-state or per-stage cumulative outcomes rather than temporal discounting or episodic returns. The criterion finds broad application in operations research, stochastic control, robotics, and RL, where long-term average behavior is often the natural evaluation metric.

1. Fundamental Definition and Mathematical Formulation

The average-reward (cost) criterion evaluates the expected long-run per-step reward (or cost) under a given policy. For a Markov (or partially observable) process with state $x_t$ , action $u_t$ , and (possibly random) reward $r(x_t, u_t)$ (or cost $c(x_t,u_t)$ ), and a stationary policy $\pi$ , the average reward is defined as: $\rho^{\pi} = \lim_{T \to \infty} \frac{1}{T} \mathbb{E}_\pi \left[ \sum_{t=0}^{T-1} r(x_t, u_t) \right]$ If the environment is ergodic or unichain, this limit is independent of the starting state and converges to a steady-state rate. The analogous average-cost formulation replaces $r(x_t, u_t)$ with $c(x_t,u_t)$ and interprets the objective as minimization.

The average-reward BeLLMan optimality equation expresses the value function up to an additive constant: $v_b^*(x) + v_g^* = \max_{u} \big[ r(x, u) + \sum_{x'} p(x'|x, u) v_b^*(x') \big]$ where $v_g^*$ is the optimal long-run average reward (the gain), and $v_b^*(x)$ is the relative (bias) function capturing transient effects (Dewanto et al., 2020).

The same principle applies to POMDPs and risk-averse formulations, and with minor modification to constrained or multi-agent games.

2. Theoretical Properties, Optimality, and Comparison with Discounted Criteria

The average-reward criterion contrasts with the discounted reward criterion, which computes: $v_\gamma^\pi(x) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r(x_t, u_t) | x_0 = x \right],$ with a discount factor $\gamma \in [0,1)$ . While discounting guarantees contraction and boundedness, it artificially prioritizes immediate rewards and complicates the policy evaluation in environments lacking an inherent time preference.

The gain (average reward) formulation is particularly natural in environments where every time step is equally important, and policy performance is evaluated by sustained, steady-state metrics (Dewanto et al., 2021). Important theoretical relationships include:

The scaled limit connection:

$v_g^\pi(x) = \lim_{\gamma \to 1} (1-\gamma) v_\gamma^\pi(x)$

The average-reward BeLLMan operator does not enjoy contraction except under additional assumptions (e.g., unichain property, communicating MDPs), complicating both analysis and algorithm design.
As shown in the Laurent expansion of the discounted value function, the average reward (gain) constitutes the leading term as $\gamma \to 1$ (Dewanto et al., 2021).

Key optimality equations, such as the Average Cost Optimality Equation (ACOE) and its relaxed form, the Average Cost Optimality Inequality (ACOI), generalize to Borel space processes with minimal regularity via majorization conditions (Yu, 2019): $\rho^* + h(x) \geq \inf_{a \in A(x)} \big\{ c(x,a) + \int h(y)\,q(dy|x,a) \big\}$

3. Algorithms and Approximate Solution Methods

Several classes of algorithms are designed or adapted for the average-reward/cost setting:

3.1 Value Iteration and Relative Value Iteration (RVI)

Relative Value Iteration stabilizes unbounded value function growth by subtracting a reference state or state-action value at each step. In reinforcement learning, this leads to RVI Q-learning: $\widehat{q}_b(s,a) \leftarrow \widehat{q}_b(s,a) + \beta \left[ r(s,a) + \max_{a'} \widehat{q}_b(s',a') - \widehat{v}_g - \widehat{q}_b(s,a) \right]$ where $\widehat{v}_g$ is an estimate of the long-run gain (Dewanto et al., 2020).

Stochastic approximation versions, including asynchronous and model-free updates, are well established, with extended convergence guarantees to weakly communicating MDPs and the semi-Markov setting (Wan et al., 29 Aug 2024, Yu et al., 5 Sep 2024). The convergence set is characterized as compact, connected, and, in weakly communicating MDPs, of degree $n^*-1$ , with $n^*$ the number of recurrent classes.

3.2 Policy Iteration and Policy Gradient

Policy iteration alternates between evaluation (solving a Poisson equation for the differential value) and improvement (greedy with respect to average-reward Q). Policy gradient formulations exploit the connection

$\nabla_\theta \rho(\theta) = \sum_{x,a} d^\pi(x,a) \nabla_\theta \log \pi(a|x;\theta) Q(x,a;\theta)$

with $d^\pi(x,a)$ the steady-state state-action distribution (Dewanto et al., 2020, Saxena et al., 2023, Cheng et al., 9 Mar 2024).

Variance-constrained extensions optimize long-run variability by introducing a risk measure $\Lambda(\theta) = \eta(\theta) - \rho(\theta)^2$ and deploying multi-timescale actor-critic updates (A. et al., 2014).

3.3 Approximations in POMDPs and High-Dimensional Systems

Discretized approximation schemes for POMDPs use finite sets of belief points and convex combinations to lower-bound and upper-bound the true average cost, reducing the partially observable problem to a multi-chain finite-state MDP, which can be efficiently solved (Yu et al., 2012).

Factored MDPs leverage structure for improved regret minimization (e.g. via DBN-UCRL), attaining regret bounds that scale with the size of local factors rather than the full state space (Talebi et al., 2020).

3.4 Robust and Constrained Learning

Robust average-reward RL with distributional robustness (e.g. uncertainty sets such as contamination and $\ell_p$ -norm balls) is addressed by the Robust Halpern Iteration (RHI), offering near-optimal polynomial finite-sample complexity and requiring no prior knowledge of the bias span (Roch et al., 18 May 2025).

CMDPs (constrained MDPs) under average cost are tackled with algorithms such as UCRL-CMDP, which simultaneously control multidimensional reward and cost regret vectors within provable regret bounds (Singh et al., 2020).

4. Extensions and Applications

The average-reward/cost criterion is foundational in a number of advanced contexts:

Hierarchical RL and Options: Hierarchical average-reward policy gradient methods extend the option-critic framework to maximize long-term reward while handling temporal abstraction, showing provable convergence and robustness to credit assignment issues (Dharmavaram et al., 2019, Wan et al., 29 Aug 2024).
Multi-Agent and Game-Theoretic RL: Markov potential games under average-reward objectives gain global convergence guarantees to Nash equilibria using independent and natural policy gradient methods, supported by explicit finite-time and sample complexity results (Cheng et al., 9 Mar 2024).
Inverse Reinforcement Learning (IRL): IRL frameworks with average-reward objectives yield new stochastic first-order methods (e.g., SPMD, IPMD) showing favorable convergence rates and robust policy recovery without reliance on discount factors (Wu et al., 2023).
Control and Tracking: In deterministic linear systems, the average-cost criterion is adapted to optimal tracking control, using deviation variables to handle unbounded steady-state costs, and solved approximately via Model Predictive Control (MPC) (Nguyen, 2 Jul 2025).
Formal Specification and Logic-Based RL: Omega-regular or LTL specifications defining infinite-horizon behavioral objectives are translated into automata-based or "reward machine" models, supporting average-reward optimization directly without resorting to discounting or resetting, and facilitating lexicographic multi-objective reward structures (Kazemi et al., 21 May 2025).

5. Convergence, Bounds, and Complexity

Rigorous analysis of average-reward algorithms is more intricate than the discounted case due to the absence of contraction and the non-uniqueness of the value function (defined up to constants). Nonetheless:

Convergence guarantees for value iteration, policy iteration, and stochastic actor-critic methods are established under assumptions of unichain, communicating, or weakly communicating structure (Yu et al., 5 Sep 2024, Wan et al., 29 Aug 2024).
Approximate solutions (discretized, function-approximation, or robust) often include error bounds both on the lower and upper side of the optimal cost (Yu et al., 2012).
Recent advances provide the first polynomial sample complexity for robust average-reward RL with the RHI algorithm: $\tilde{\mathcal O}\left(\frac{SA\mathcal H^2}{\epsilon^2}\right)$ with parameters defined as above (Roch et al., 18 May 2025).
For constrained and multi-objective problems, explicit trade-offs and tuning mechanisms are introduced, e.g., via multidimensional regret vectors (Singh et al., 2020).

6. Practical Implementation and Impact

Average-reward/cost criterion methods are well-suited for non-episodic, long-horizon settings where the performance metric is the steady-state or mean rate. As established in empirical studies across traffic control, network management, queuing systems, robotics, and continuous-control RL (including MuJoCo benchmarks), average-reward approaches typically align naturally with the objective evaluation metric, avoid the complications and instability of high discount factors, and permit more interpretable deployment in settings where transient rewards are not meaningful (Ma et al., 2021, Hisaki et al., 4 Aug 2024, Saxena et al., 2023, Kazemi et al., 21 May 2025).

In summary, the average-reward/cost criterion underpins a wide spectrum of modern RL and control research, extending from foundational theory and optimality guarantees to efficient, scalable algorithms suitable for high-dimensional, partially observable, risk-sensitive, robust, constrained, hierarchical, and multi-agent systems.