Geometric Policy Optimization (GeoPO)

Updated 15 December 2025

GeoPO is a reinforcement learning framework that leverages the geometric structure of policy spaces, including convexity and manifold theory, for efficient control.
It reformulates POMDP control as linear optimization over convex components, enabling significant dimensionality reduction and deterministic bounds for stochastic policies.
The approach informs policy-gradient methods by advocating low-dimensional model selection and projection-based updates to achieve faster and more reliable convergence.

Geometric Policy Optimization (GeoPO) refers to a class of policy optimization methods in reinforcement learning and control that explicitly leverage the geometric structure of the policy space—typically via convexity, polyhedral cones, Riemannian or information geometry, or manifold theory—to formulate more efficient, structured, or theoretically grounded policy updates. In the context of stationary control in finite POMDPs, GeoPO provides a geometric characterization of the set of memoryless stochastic policies and reformulates the control objective as linear optimization over complex, often nonconvex spaces with exploitable low-dimensional structure. This framework yields new deterministic bounds for stochastic policies, establishes explicit dimensionality reduction results, and suggests algorithmic design principles for policy-gradient methods with provable improvements in convergence behavior (Montufar et al., 2015).

1. Geometric Structure of Policy Space in POMDPs

Under the GeoPO framework, the space of memoryless stochastic policies in a partially observable Markov decision process (POMDP) with finite observation set $\mathcal O$ and action set $\mathcal A$ is formalized as a product of simplices: $\Delta_{\mathcal O, \mathcal A} = \prod_{o \in \mathcal O} \Delta_{\mathcal A},$ where each factor is a simplex $\Delta_{\mathcal A}$ encoding a probability distribution over $\mathcal A$ given observation $o$ . The resulting policy space is a convex polytope of dimension $|\mathcal O| (|\mathcal A| - 1)$ .

Running a policy $\pi$ in the POMDP induces a stationary joint distribution over world states and actions, $p_\pi(w, a)$ . The average reward per time step is linear in $p_\pi(w, a)$ : $R(\pi) = \sum_{w, a} p_\pi(w, a) R(w, a).$ However, not all $p(w, a)$ arise from admissible policies due to constraints imposed by the observation and transition kernels; thus, the control problem is a linear program with the feasibility set $F \cap J$ , where $F$ encodes representability constraints and $J$ enforces stationarity (Kirchhoff polytope).

The geometric insight is that, although $F$ can be nonconvex (generally a union of twisted copies of a convex set), the optimization over each convex component behaves like a linear program, and hence optima are located at extreme points of those components (Montufar et al., 2015).

2. Dimensionality Reduction and Determinism Bound

A central result is the reduction in the effective dimension of the stochastic policy needed for optimality. In fully observable MDPs, optimal memoryless policies are always deterministic. In POMDPs, perceptual aliasing leads to intrinsically stochastic optimal policies. GeoPO quantifies this necessity and provides a tight upper bound.

Define the "ambiguous observation set"

$U = \{ o \in \mathcal O : |\text{supp}\, \beta(o | \cdot)| > 1 \},$

where $\beta(o | w)$ is the observation kernel. GeoPO establishes:

Determinism Bound for POMDPs:

There exists an optimal stationary policy $\pi^*$ that is $m$ -stochastic, where

$m = |U| (|\mathcal A| - 1).$

That is, $\pi^*$ has at most $|\mathcal O| + m$ nonzero entries and lies in an $m$ -face of the policy simplex. This bound is worst-case tight and enables a strong dimensionality reduction: only ambiguous observations require additional stochasticity, and even then only among a low-dimensional face of the product simplex (Montufar et al., 2015).

This geometric insight allows for focusing policy search—both analytically and algorithmically—on low-dimensional families or structured mixtures, drastically reducing overparameterization.

3. Algorithmic Implications for Policy-Gradient Methods

In practice, policy-gradient algorithms such as GPOMDP perform unconstrained stochastic gradient ascent in the full policy simplex. GeoPO recommends two complementary modifications:

Model selection: Choose a differentiable parametric family $\mathcal M \subset \Delta_{\mathcal O, \mathcal A}$ $M \subset Δ_{O, A}$ of dimension $d \approx m$ $d \approx m$ containing all possible $m$ $m$ -faces. Notable examples are:
- $k$ -interaction exponential families with $k$ such that $2^k - 1 \geq |\mathcal O| + m$
- Mixtures of up to $m+1$ deterministic policies
- Conditional RBMs with at least $|\mathcal O| + m - 1$ hidden units
Projected gradient method: After each unconstrained update, project the policy back onto the chosen low-dimensional manifold or face. This may be implemented via a projection minimization step (e.g., Newton or mirror descent) or by explicit reparametrization (e.g., softmax over low-dimensional subspaces).

Explicitly, if the policy has exponential-family form

$\pi_\theta(a|o) = \frac{\exp(\theta \cdot F(o, a))}{\sum_{a'} \exp(\theta \cdot F(o, a'))}$

and $F$ spans the $m$ -face, standard gradient ascent remains within this face. Otherwise, projection is required to enforce feasibility (Montufar et al., 2015).

These procedures ensure that the policy-gradient iterates do not drift into unnecessarily high-dimensional or poorly structured regions of the policy space.

4. Empirical Results and Validation

Empirical validation is provided in the "ambiguous-maze" benchmark, with $|\mathcal O| = 14$ , $|\mathcal A| = 4$ , and $|U| = 2$ ambiguous observations (so $m = 6$ ). The policy search space has full dimension $30$, but optimization is carried out using $k$ -interaction models of dimensions $2, 11, 23, 29, 30$ for $k = 1, \ldots, 5$ and the full simplex.

Results demonstrate:

Models with $k = 1, 2$ (dimensions $2, 11$) cannot reach optimal reward.
$k = 3$ ( $\dim = 23$ ) suffices for optimality and achieves fastest empirical convergence.
Larger models ( $k = 4, 5, 30$ ) also reach optimal reward, but with slower convergence due to increased noise sensitivity and overparameterization.

Convergence trajectories show that the minimally sufficient, geometrically motivated family achieves superior practical performance. This empirically substantiates the value of the determinism bound and dimensionality reduction provided by GeoPO (Montufar et al., 2015).

5. Extensions and Broader Context

The GeoPO framework for POMDPs arises in a broader context of geometric methods for policy optimization across reinforcement learning, mathematical programming, and control. Related approaches include:

Geometric improvement cones and support-sparsity results, offering explicit certificate-based bounds on the required stochasticity in memoryless policies (Montufar et al., 2017).
Riemannian geometry and Hessian-based updates in control, particularly in linear-quadratic-Gaussian (LQG) and $\mathcal H_\infty$ settings, where the manifold structure and natural metrics over stabilizing controllers enable enhanced convergence and avoidance of spurious stationary points (Kraisler et al., 25 Mar 2024, Talebi et al., 6 Jun 2024).
Information geometry-based policy optimization (natural gradient), which leverages the Fisher-Rao metric for invariant updates and underpins parametrization-independent learning in complex stochastic models (Bensadon, 2013).

GeoPO remains an active area of research, serving as a theoretical foundation for both principled algorithm design and concrete algorithmic acceleration in stochastic control, reinforcement learning, and machine learning systems characterized by partial observability or nontrivial policy geometry.

6. References

"Geometry and Determinism of Optimal Stationary Control in Partially Observable Markov Decision Processes" (Montufar et al., 2015)
"Geometry of Policy Improvement" (Montufar et al., 2017)