Partially Group-Invariant MDP (PI-MDP)

Updated 7 December 2025

PI-MDP is a framework that generalizes group-invariant MDPs to handle environments with locally preserved symmetries.
It uses a state–action gating function to mix equivariant and standard Bellman updates, effectively controlling error propagation.
PI-MDP underpins algorithms like PE-DQN and PE-SAC, leading to improved sample efficiency and robust performance in partial symmetry settings.

A Partially group-Invariant Markov Decision Process (PI-MDP) generalizes the notion of group-invariant MDPs to account for environments in which the symmetries underlying state–action spaces are only locally or partially preserved. In reinforcement learning (RL), exploiting equivariance to group actions over state and action spaces provides strong inductive bias and yields improved sample efficiency. However, real-world domains typically exhibit only approximate or local symmetries due to constraints and design choices that break group invariance. PI-MDPs formalize the selective exploitation of symmetries where they hold and robustly revert to regular RL in regions where the symmetry is broken, preventing the catastrophic global propagation of estimation errors. This framework enables the construction of RL algorithms such as PE-DQN and PE-SAC that dynamically interpolate between equivariant and non-equivariant updates, resulting in increased robustness and generalizability (Chang et al., 30 Nov 2025, Pol et al., 2020).

1. Formal Foundations and Symmetry Structure

Let $\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,\gamma)$ be a Markov Decision Process with measurable state space $\mathcal{S}$ , finite or continuous action space $\mathcal{A}$ , transition kernel $P(s'|s,a)$ , reward function $R(s,a)$ , and discount factor $\gamma\in(0,1)$ .

Consider a group $G$ acting on $\mathcal{S}$ and $\mathcal{A}$ via linear representations $\rho_\mathcal{S}(g)$ and $\rho_\mathcal{A}(g)$ , respectively. A function $f:\mathcal{S}\rightarrow\mathbb{R}^n$ is $G$ -equivariant if $f(\rho_\mathcal{S}(g)s)=\rho_\mathcal{F}(g)f(s)$ . The MDP is called fully $G$ -invariant if $\forall g\in G$ ,

$R(s,a)=R(\rho_\mathcal{S}(g)s,\rho_\mathcal{A}(g)a)$

$P(s'|s,a)=P(\rho_\mathcal{S}(g)s'|\rho_\mathcal{S}(g)s,\rho_\mathcal{A}(g)a)$

A group-structured MDP homomorphism is defined when abstracting state–action orbits under $G$ , yielding optimal-value equivalence across each orbit (Pol et al., 2020).

In practice, most environments violate these equalities outside a strict subset of $(s,a)$ . PI-MDP formalism explicitly accounts for such local symmetry-breaking using a gating function that selects equivariant or unconstrained Bellman updates according to local symmetry validity.

2. PI-MDP Definition and Bellman Operator Mixing

Let $G$ denote the full group of symmetries; $H\subseteq G$ the subgroup under which invariance is assumed to hold robustly (possibly context-dependent). For the true environment kernel $(P_N,R_N)$ and the group-averaged symmetric kernel $(P_E,R_E)$ , define pointwise error metrics: $\epsilon_R(s,a) = |R_N(s,a) - R_E(s,a)|;\;\; \epsilon_P(s,a) = \frac{1}{2} \int |P_N(s'|s,a) - P_E(s'|s,a)| ds'$ Symmetry is defined as broken at $(s, a)$ if either error is nonzero.

The PI-MDP defines a measurable gating function $\lambda:\mathcal{S}\times\mathcal{A}\to[0,1]$ ; the symmetric and asymmetric components are linearly mixed: $R_H(s,a) = (1-\lambda(s,a))R_E(s,a) + \lambda(s,a)R_N(s,a)$

$P_H(\cdot|s,a) = (1-\lambda(s,a))P_E(\cdot|s,a) + \lambda(s,a)P_N(\cdot|s,a)$

The corresponding "PI-Bellman operator" is: $\mathcal{T}_HQ(s,a) = (1-\lambda(s,a))\mathcal{T}_E Q(s,a) + \lambda(s,a)\mathcal{T}_N Q(s,a)$ with $\mathcal{T}_N$ and $\mathcal{T}_E$ the hard-max Bellman operators in the true and group-invariant MDP, respectively.

Crucially, the affine mixture form ensures $\mathcal{T}_H$ is a $\gamma$ -contraction, guaranteeing a unique fixed point $Q^*_H$ (Chang et al., 30 Nov 2025). When symmetry is entirely preserved ( $\lambda=0$ ), the dynamics are strictly equivariant; when fully broken ( $\lambda=1$ ), the system reduces to standard non-equivariant RL.

3. Error Propagation and Theoretical Guarantees

The PI-MDP framework enables explicit analysis of error propagation from local symmetry-breaking, quantified via: $\delta(s,a) = \epsilon_R(s,a) + 2\gamma V_\text{max} \epsilon_P(s,a)$ where $V_\text{max} = R_\text{max}/(1-\gamma)$ . The one-step Bellman error admits

$|(\mathcal{T}_N Q)(s,a) - (\mathcal{T}_E Q)(s,a)| \leq \delta(s,a)$

Global value gap under full symmetry breaking obeys: $\|Q^*_N - Q^*_E\|_\infty \leq \frac{1}{1-\gamma} \sup_{s,a} \delta(s,a)$ The PI-MDP tightens this bound by zeroing out $\delta$ wherever $\lambda=1$ , controlling error propagation. Specifically, letting $Q^*_H$ be the fixed point of $\mathcal{T}_H$ ,

$\|Q^*_H - Q^*_N\|_\infty \leq \frac{1}{1-\gamma} \|(1-\lambda)\delta\|_\infty$

If $\lambda(s,a)=1$ at every symmetry-breaking location, the PI-MDP solution exactly matches the true value function (Chang et al., 30 Nov 2025).

4. Symmetry Detection and Gating Mechanism

Local symmetry-breaking is automatically identified via a disagreement-based gating process:

Two one-step predictors $\hat{P}_E$ (equivariant) and $\hat{P}_N$ (unconstrained) are trained on transitions $(s,a,r,s')$ .
Disagreement score $d(s,a)$ (e.g., next-state prediction error or total variation) quantifies local symmetry validity.
A running mean $\mu$ , standard deviation $\sigma$ , and threshold $\tau=\mu+\kappa\sigma$ are maintained, assigning pseudo-labels $y(s,a)=\mathbb{I}\{d>\tau\}$ .
A gating network $\lambda_\omega(s,a)\in(0,1)$ is trained with binary cross-entropy loss to predict $y(s,a)$ . In continuous-action settings, a state-dependent "actor-gate" $\lambda_\zeta(s)$ is trained via expectile regression to approximate $\max_a\lambda_\omega(s,a)$ , yielding a sampled binary gate at inference.

This gating process enables selective application of equivariant or non-equivariant network updates, ensuring that TD errors arising from symmetry assumption violations are not globally propagated (Chang et al., 30 Nov 2025).

5. PI-MDP Algorithmic Instantiations

The framework yields Partially Equivariant RL algorithms:

PE-DQN (discrete control): The critic $Q_\theta(s,a) = (1-\lambda(s,a))Q_{E,\theta}(s,a) + \lambda(s,a)Q_{N,\theta}(s,a)$ blends outputs of an equivariant (e.g., MDP-homomorphic) and a non-equivariant MLP critic, updating via DQN loss on the mixture and employing $\epsilon$ -greedy exploration. One-step predictors and gates are trained online from the replay buffer (Chang et al., 30 Nov 2025).

PE-SAC (continuous control): Actor is specified as a product-of-experts (PoE) mixture,

$\pi_\phi(\cdot|s) \propto [\pi_{E,\phi}(\cdot|s)]^{1-\lambda_\zeta(s)} \cdot [\pi_{N,\phi}(\cdot|s)]^{\lambda_\zeta(s)}$

with $\lambda_\zeta(s)\in\{0,1\}$ . Q-functions are similarly blended as in PE-DQN. Updates for critics, actor, predictors, and gates use separate parameter trunks and employ stochastic target updates (Chang et al., 30 Nov 2025).

Network architectures leverage EMLP or steerable CNNs for equivariant trunks, with standard MLPs for non-equivariant components. This design allows the system to exploit symmetry where valid, while robustly defaulting to empirically driven learning elsewhere.

6. Empirical Behavior and Sample Efficiency

Empirical evaluation demonstrates substantial improvements in sample efficiency and robustness across various domains:

Domain	Symmetry Group	Symmetry Breaking Source	Result for PI-MDP
Grid-World (C₄)	$C_4$	Fixed obstacles, reward	2–3× more sample-efficient than all baselines, graceful degradation as obstacles increase
MuJoCo Locomotion	$Z_2, Z_4$	Asymmetric parameters, R(s)	Strict equivariant collapses under large breaking; PI-SAC consistently best
Robotic Manipulation	$SO(3), SE(3)$	Collisions, kinematic limits	PI-SAC robust; achieves 20–30% higher returns in partial symmetry breakage

PE-DQN and PE-SAC consistently outperform vanilla RL, strict equivariant, and approximately equivariant baselines, especially in settings with moderate symmetry-breaking. When symmetries are nearly totally absent, PI-MDP naturally falls back to standard RL, matching baseline performance (Chang et al., 30 Nov 2025).

7. Broader Implications and Future Directions

PI-MDP formalism provides a theoretically justified, empirically validated approach for exploiting group-induced inductive biases selectively. The selective equivariance mechanism avoids catastrophic error propagation by reverting to unconstrained updates where group structure fails, leading to robust and efficient learning.

Current limitations include additional computational overhead from extra networks and gating, and reduced effect in environments where symmetry is mostly absent (e.g., gravity-dominated tasks). Prospective research directions are:

Integration with pixel-based state representations and equivariant CNNs.
Automatic discovery and maintenance of group structure $G$ .
Adaptive, hierarchical gating across state–action space (Chang et al., 30 Nov 2025).

In sum, PI-MDPs represent a principled generalization of equivariant RL, delivering efficient sample utilization and robust value estimation in environments exhibiting partial or locally valid symmetries (Chang et al., 30 Nov 2025, Pol et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

Partially Equivariant Reinforcement Learning in Symmetry-Breaking Environments (2025)

MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Partially group-Invariant MDP (PI-MDP).