Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

Advantage Normalization in MDPs

Updated 13 August 2025
  • Advantage normalization is a procedure that reformulates reward structures in MDPs to preserve invariant advantage relationships across policies.
  • It employs a geometric framework that maps actions to affine vectors, enabling reward balancing and zero-value normalization for efficient policy selection.
  • This approach facilitates value-free solvers that improve convergence rates and policy improvement robustness in reinforcement learning applications.

Advantage normalization is a procedure in sequential decision-making and reinforcement learning that transforms the reward structure and value function of a Markov Decision Process (MDP) without altering the fundamental advantage relationships of actions under any policy. This approach is motivated by a geometric perspective on MDPs, in which every action is mapped to a vector in an affine space and each policy corresponds to a hyperplane orthogonal to this vector representation. By leveraging an advantage-preserving reward transformation, advantage normalization facilitates the design of algorithms—such as reward balancing solvers—that efficiently compute near-optimal policies, often with improved convergence properties and “value-free” dynamics. In practical terms, advantage normalization is used to simplify policy selection, reduce dependence on value-function computations, and clarify the theoretical connections between MDP geometry, policy improvement, and optimality (Mustafin et al., 9 Jul 2024).

1. Geometric Representation of MDPs

The geometric framework for MDPs begins by embedding the action space into (n+1)(n+1) dimensions, with each action aa represented by a vector a+=(ra,γp1a1,γp2a,,γpna)a_+ = (r^a, \gamma p_1^a - 1, \gamma p_2^a, \ldots, \gamma p_n^a), where rar^a is the immediate reward and piap_i^a are the transition probabilities. The BeLLMan equation is expressed as:

0=ra+(γp1a1)V1+γp2aV2++γpnaVn0 = r^a + (\gamma p_1^a - 1)V_1 + \gamma p_2^a V_2 + \ldots + \gamma p_n^a V_n

This formulation is interpreted as an inner product between the action vector a+a_+ and an extended policy vector V+π=(1,V1,V2,,Vn)V_+^\pi = (1, V_1, V_2, \ldots, V_n). When an action is part of the policy π\pi, a+a_+ is orthogonal to V+πV_+^\pi (as formalized in Proposition 2.1). The set of allowable action vectors for a state ss is defined by affine constraints:

  • x1++xn=γ1x_1 + \ldots + x_n = \gamma - 1
  • xi0x_i \geq 0 for all isi \neq s, and xs<0x_s < 0.

A policy is thus geometrically characterized by the hyperplane through the action vectors it selects.

2. Advantage-Preserving Normalization Procedure

The core advantage normalization procedure adjusts the reward of every action in the MDP such that the values of the optimal policy become zero at every state, while leaving the advantage of any action under any policy unchanged. For state ss and a chosen shift δ\delta, the transformation Lsδ\mathcal{L}_s^\delta modifies rewards as follows:

  • For actions aa with st(a)=s\operatorname{st}(a) = s: raraδ(γpsa1)r^a \gets r^a - \delta (\gamma p_s^a - 1).
  • For actions bb with st(b)s\operatorname{st}(b) \neq s: rbrbδ(γpsb)r^b \gets r^b - \delta (\gamma p_s^b).

This corresponds geometrically to the map (x0,x1,...,xn)(x0δxs,x1,...,xn)(x_0, x_1, ..., x_n) \mapsto (x_0 - \delta x_s, x_1, ..., x_n), effectively “lifting” the self-loop line LsL_s by δ(1γ)\delta (1-\gamma). Applying LsV(s)\mathcal{L}_s^{-\mathbf{V}^*(s)} for all states yields a normalized MDP M\mathcal{M}^* in which the optimal policy achieves zero value everywhere.

3. Advantage Invariance and Its Consequences

The principal property of the normalization transformation is the invariance of the advantage function:

adv(b,π)=rb+γspsbVsπVst(b)π\operatorname{adv}(b, \pi) = r^b + \gamma \sum_s p_s^b V_s^\pi - V_{\operatorname{st}(b)}^\pi

Under Lsδ\mathcal{L}_s^\delta, both rewards and policy values are shifted in such a way that the advantage of every action remains constant. Algebraic derivations in the source establish that a+V+=a+V+a_+ V_+ = a_+' V_+' post-transformation for all actions and policies. Thus, the critical ordering of actions (for policy improvement and selection) is unaffected, and optimal policy characteristics persist under normalization.

4. Reward Balancing Algorithms

Advantage normalization motivates a category of “reward balancing” algorithms that iteratively adjust action rewards to reach a normal form where the maximum reward at each state is zero. In this normal form, optimal actions correspond exactly to those with zero reward, and all others have strictly negative reward. The Value-Free Solver (VFS) is an explicit example:

  • For each state ss, compute $\delta_s = \text{minimum reward in state$s$} / (\gamma p_s - 1)$.
  • Update the MDP via Lsδs\mathcal{L}_s^{\delta_s}.
  • Repeat until the maximum reward in every state is zero (or nearly zero, subject to an approximation lemma).

A proven lemma (Lemma 4.1) bounds the suboptimality by ε=rmin/(1γ)\varepsilon = -r_{\text{min}}/(1-\gamma), with rminr_{\text{min}} being the minimal reward in any state. Thus, reward balancing ensures ε\varepsilon-optimality without explicit computation of policy value functions.

5. Convergence Analysis and Comparison to Standard Methods

Within the geometric framework, traditional Policy Iteration (PI) and Value Iteration (VI) can be interpreted as sequences of hyperplane constructions approaching the optimal one. PI selects hyperplanes such that selected actions have maximal advantage relative to the current estimate, with improvements driven by “lifting” toward optimality. VI updates state values by:

Vt+1(s)=Vt(s)+(1γ)adv(a,Vt)V_{t+1}(s) = V_t(s) + (1-\gamma) \operatorname{adv}(a^*, V_t)

where aa^* is the action maximizing the advantage with respect to VtV_t. The convergence of reward balancing solvers, particularly in tree-structured MDPs, is demonstrated to be faster than the discount rate γ\gamma, and sample complexity can outperform previous state-of-the-art results in settings with unknown transitions. Some open theoretical questions remain about the equivalence of VFS and standard VI dynamics.

6. Practical Implications and Applications

Advantage normalization clarifies the geometric structure underlying MDP solution methods, simplifying policy selection by reducing the problem to finding actions with zero reward after normalization. The value-free nature of reward balancing solvers eliminates dependence on value-function estimation, mitigating sensitivity to initialization and facilitating rapid convergence in various MDP architectures (especially hierarchical and tree-based). The approach may inform the development of more robust reinforcement learning algorithms and function-approximation methods, particularly where partial knowledge of the MDP prevails. Moreover, the geometric and normalization perspective enhances the understanding of advantage as the fundamental currency of policy improvement, potentially influencing future research in dynamic programming and reinforcement learning (Mustafin et al., 9 Jul 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)