Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Average-Reward/Cost Criterion in Markov Processes

Updated 5 July 2025
  • The average-reward/cost criterion is a framework for evaluating steady-state per-step rewards or costs in infinite-horizon Markov and partially observable decision processes.
  • It employs methods like relative value iteration and policy gradient to solve for optimal policies when discounting is inappropriate, enabling robust and scalable solutions.
  • Key applications span operations research, robotics, and control systems where continuous performance evaluation and risk management are critical, with solid theoretical guarantees on convergence and complexity.

The average-reward (or average-cost) criterion is a central concept in the theory and practice of Markov decision processes (MDPs), partially observable MDPs (POMDPs), reinforcement learning (RL), and control. It provides a framework for assessing the long-term performance of policies in infinite-horizon, continuing environments by focusing on steady-state or per-stage cumulative outcomes rather than temporal discounting or episodic returns. The criterion finds broad application in operations research, stochastic control, robotics, and RL, where long-term average behavior is often the natural evaluation metric.

1. Fundamental Definition and Mathematical Formulation

The average-reward (cost) criterion evaluates the expected long-run per-step reward (or cost) under a given policy. For a Markov (or partially observable) process with state xtx_t, action utu_t, and (possibly random) reward r(xt,ut)r(x_t, u_t) (or cost c(xt,ut)c(x_t,u_t)), and a stationary policy π\pi, the average reward is defined as: ρπ=limT1TEπ[t=0T1r(xt,ut)]\rho^{\pi} = \lim_{T \to \infty} \frac{1}{T} \mathbb{E}_\pi \left[ \sum_{t=0}^{T-1} r(x_t, u_t) \right] If the environment is ergodic or unichain, this limit is independent of the starting state and converges to a steady-state rate. The analogous average-cost formulation replaces r(xt,ut)r(x_t, u_t) with c(xt,ut)c(x_t,u_t) and interprets the objective as minimization.

The average-reward BeLLMan optimality equation expresses the value function up to an additive constant: vb(x)+vg=maxu[r(x,u)+xp(xx,u)vb(x)]v_b^*(x) + v_g^* = \max_{u} \big[ r(x, u) + \sum_{x'} p(x'|x, u) v_b^*(x') \big] where vgv_g^* is the optimal long-run average reward (the gain), and vb(x)v_b^*(x) is the relative (bias) function capturing transient effects (2010.08920).

The same principle applies to POMDPs and risk-averse formulations, and with minor modification to constrained or multi-agent games.

2. Theoretical Properties, Optimality, and Comparison with Discounted Criteria

The average-reward criterion contrasts with the discounted reward criterion, which computes: vγπ(x)=Eπ[t=0γtr(xt,ut)x0=x],v_\gamma^\pi(x) = \mathbb{E}_\pi \left[ \sum_{t=0}^{\infty} \gamma^t r(x_t, u_t) | x_0 = x \right], with a discount factor γ[0,1)\gamma \in [0,1). While discounting guarantees contraction and boundedness, it artificially prioritizes immediate rewards and complicates the policy evaluation in environments lacking an inherent time preference.

The gain (average reward) formulation is particularly natural in environments where every time step is equally important, and policy performance is evaluated by sustained, steady-state metrics (2107.01348). Important theoretical relationships include:

  • The scaled limit connection:

vgπ(x)=limγ1(1γ)vγπ(x)v_g^\pi(x) = \lim_{\gamma \to 1} (1-\gamma) v_\gamma^\pi(x)

  • The average-reward BeLLMan operator does not enjoy contraction except under additional assumptions (e.g., unichain property, communicating MDPs), complicating both analysis and algorithm design.
  • As shown in the Laurent expansion of the discounted value function, the average reward (gain) constitutes the leading term as γ1\gamma \to 1 (2107.01348).

Key optimality equations, such as the Average Cost Optimality Equation (ACOE) and its relaxed form, the Average Cost Optimality Inequality (ACOI), generalize to Borel space processes with minimal regularity via majorization conditions (1901.03374): ρ+h(x)infaA(x){c(x,a)+h(y)q(dyx,a)}\rho^* + h(x) \geq \inf_{a \in A(x)} \big\{ c(x,a) + \int h(y)\,q(dy|x,a) \big\}

3. Algorithms and Approximate Solution Methods

Several classes of algorithms are designed or adapted for the average-reward/cost setting:

3.1 Value Iteration and Relative Value Iteration (RVI)

Relative Value Iteration stabilizes unbounded value function growth by subtracting a reference state or state-action value at each step. In reinforcement learning, this leads to RVI Q-learning: q^b(s,a)q^b(s,a)+β[r(s,a)+maxaq^b(s,a)v^gq^b(s,a)]\widehat{q}_b(s,a) \leftarrow \widehat{q}_b(s,a) + \beta \left[ r(s,a) + \max_{a'} \widehat{q}_b(s',a') - \widehat{v}_g - \widehat{q}_b(s,a) \right] where v^g\widehat{v}_g is an estimate of the long-run gain (2010.08920).

Stochastic approximation versions, including asynchronous and model-free updates, are well established, with extended convergence guarantees to weakly communicating MDPs and the semi-Markov setting (2408.16262, 2409.03915). The convergence set is characterized as compact, connected, and, in weakly communicating MDPs, of degree n1n^*-1, with nn^* the number of recurrent classes.

3.2 Policy Iteration and Policy Gradient

Policy iteration alternates between evaluation (solving a Poisson equation for the differential value) and improvement (greedy with respect to average-reward Q). Policy gradient formulations exploit the connection

θρ(θ)=x,adπ(x,a)θlogπ(ax;θ)Q(x,a;θ)\nabla_\theta \rho(\theta) = \sum_{x,a} d^\pi(x,a) \nabla_\theta \log \pi(a|x;\theta) Q(x,a;\theta)

with dπ(x,a)d^\pi(x,a) the steady-state state-action distribution (2010.08920, 2305.12239, 2403.05738).

Variance-constrained extensions optimize long-run variability by introducing a risk measure Λ(θ)=η(θ)ρ(θ)2\Lambda(\theta) = \eta(\theta) - \rho(\theta)^2 and deploying multi-timescale actor-critic updates (1403.6530).

3.3 Approximations in POMDPs and High-Dimensional Systems

Discretized approximation schemes for POMDPs use finite sets of belief points and convex combinations to lower-bound and upper-bound the true average cost, reducing the partially observable problem to a multi-chain finite-state MDP, which can be efficiently solved (1207.4154).

Factored MDPs leverage structure for improved regret minimization (e.g. via DBN-UCRL), attaining regret bounds that scale with the size of local factors rather than the full state space (2009.04575).

3.4 Robust and Constrained Learning

Robust average-reward RL with distributional robustness (e.g. uncertainty sets such as contamination and p\ell_p-norm balls) is addressed by the Robust Halpern Iteration (RHI), offering near-optimal polynomial finite-sample complexity and requiring no prior knowledge of the bias span (2505.12462).

CMDPs (constrained MDPs) under average cost are tackled with algorithms such as UCRL-CMDP, which simultaneously control multidimensional reward and cost regret vectors within provable regret bounds (2002.12435).

4. Extensions and Applications

The average-reward/cost criterion is foundational in a number of advanced contexts:

  • Hierarchical RL and Options: Hierarchical average-reward policy gradient methods extend the option-critic framework to maximize long-term reward while handling temporal abstraction, showing provable convergence and robustness to credit assignment issues (1911.08826, 2408.16262).
  • Multi-Agent and Game-Theoretic RL: Markov potential games under average-reward objectives gain global convergence guarantees to Nash equilibria using independent and natural policy gradient methods, supported by explicit finite-time and sample complexity results (2403.05738).
  • Inverse Reinforcement Learning (IRL): IRL frameworks with average-reward objectives yield new stochastic first-order methods (e.g., SPMD, IPMD) showing favorable convergence rates and robust policy recovery without reliance on discount factors (2305.14608).
  • Control and Tracking: In deterministic linear systems, the average-cost criterion is adapted to optimal tracking control, using deviation variables to handle unbounded steady-state costs, and solved approximately via Model Predictive Control (MPC) (2507.01556).
  • Formal Specification and Logic-Based RL: Omega-regular or LTL specifications defining infinite-horizon behavioral objectives are translated into automata-based or "reward machine" models, supporting average-reward optimization directly without resorting to discounting or resetting, and facilitating lexicographic multi-objective reward structures (2505.15693).

5. Convergence, Bounds, and Complexity

Rigorous analysis of average-reward algorithms is more intricate than the discounted case due to the absence of contraction and the non-uniqueness of the value function (defined up to constants). Nonetheless:

  • Convergence guarantees for value iteration, policy iteration, and stochastic actor-critic methods are established under assumptions of unichain, communicating, or weakly communicating structure (2409.03915, 2408.16262).
  • Approximate solutions (discretized, function-approximation, or robust) often include error bounds both on the lower and upper side of the optimal cost (1207.4154).
  • Recent advances provide the first polynomial sample complexity for robust average-reward RL with the RHI algorithm: O~(SAH2ϵ2)\tilde{\mathcal O}\left(\frac{SA\mathcal H^2}{\epsilon^2}\right) with parameters defined as above (2505.12462).
  • For constrained and multi-objective problems, explicit trade-offs and tuning mechanisms are introduced, e.g., via multidimensional regret vectors (2002.12435).

6. Practical Implementation and Impact

Average-reward/cost criterion methods are well-suited for non-episodic, long-horizon settings where the performance metric is the steady-state or mean rate. As established in empirical studies across traffic control, network management, queuing systems, robotics, and continuous-control RL (including MuJoCo benchmarks), average-reward approaches typically align naturally with the objective evaluation metric, avoid the complications and instability of high discount factors, and permit more interpretable deployment in settings where transient rewards are not meaningful (2106.03442, 2408.01972, 2305.12239, 2505.15693).

In summary, the average-reward/cost criterion underpins a wide spectrum of modern RL and control research, extending from foundational theory and optimality guarantees to efficient, scalable algorithms suitable for high-dimensional, partially observable, risk-sensitive, robust, constrained, hierarchical, and multi-agent systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)