Partially Group-Invariant MDP (PI-MDP)
- PI-MDP is a framework that generalizes group-invariant MDPs to handle environments with locally preserved symmetries.
- It uses a state–action gating function to mix equivariant and standard Bellman updates, effectively controlling error propagation.
- PI-MDP underpins algorithms like PE-DQN and PE-SAC, leading to improved sample efficiency and robust performance in partial symmetry settings.
A Partially group-Invariant Markov Decision Process (PI-MDP) generalizes the notion of group-invariant MDPs to account for environments in which the symmetries underlying state–action spaces are only locally or partially preserved. In reinforcement learning (RL), exploiting equivariance to group actions over state and action spaces provides strong inductive bias and yields improved sample efficiency. However, real-world domains typically exhibit only approximate or local symmetries due to constraints and design choices that break group invariance. PI-MDPs formalize the selective exploitation of symmetries where they hold and robustly revert to regular RL in regions where the symmetry is broken, preventing the catastrophic global propagation of estimation errors. This framework enables the construction of RL algorithms such as PE-DQN and PE-SAC that dynamically interpolate between equivariant and non-equivariant updates, resulting in increased robustness and generalizability (Chang et al., 30 Nov 2025, Pol et al., 2020).
1. Formal Foundations and Symmetry Structure
Let be a Markov Decision Process with measurable state space , finite or continuous action space , transition kernel , reward function , and discount factor .
Consider a group acting on and via linear representations and , respectively. A function is -equivariant if . The MDP is called fully -invariant if ,
A group-structured MDP homomorphism is defined when abstracting state–action orbits under , yielding optimal-value equivalence across each orbit (Pol et al., 2020).
In practice, most environments violate these equalities outside a strict subset of . PI-MDP formalism explicitly accounts for such local symmetry-breaking using a gating function that selects equivariant or unconstrained Bellman updates according to local symmetry validity.
2. PI-MDP Definition and Bellman Operator Mixing
Let denote the full group of symmetries; the subgroup under which invariance is assumed to hold robustly (possibly context-dependent). For the true environment kernel and the group-averaged symmetric kernel , define pointwise error metrics: Symmetry is defined as broken at if either error is nonzero.
The PI-MDP defines a measurable gating function ; the symmetric and asymmetric components are linearly mixed:
The corresponding "PI-Bellman operator" is: with and the hard-max Bellman operators in the true and group-invariant MDP, respectively.
Crucially, the affine mixture form ensures is a -contraction, guaranteeing a unique fixed point (Chang et al., 30 Nov 2025). When symmetry is entirely preserved (), the dynamics are strictly equivariant; when fully broken (), the system reduces to standard non-equivariant RL.
3. Error Propagation and Theoretical Guarantees
The PI-MDP framework enables explicit analysis of error propagation from local symmetry-breaking, quantified via: where . The one-step Bellman error admits
Global value gap under full symmetry breaking obeys: The PI-MDP tightens this bound by zeroing out wherever , controlling error propagation. Specifically, letting be the fixed point of ,
If at every symmetry-breaking location, the PI-MDP solution exactly matches the true value function (Chang et al., 30 Nov 2025).
4. Symmetry Detection and Gating Mechanism
Local symmetry-breaking is automatically identified via a disagreement-based gating process:
- Two one-step predictors (equivariant) and (unconstrained) are trained on transitions .
- Disagreement score (e.g., next-state prediction error or total variation) quantifies local symmetry validity.
- A running mean , standard deviation , and threshold are maintained, assigning pseudo-labels .
- A gating network is trained with binary cross-entropy loss to predict . In continuous-action settings, a state-dependent "actor-gate" is trained via expectile regression to approximate , yielding a sampled binary gate at inference.
This gating process enables selective application of equivariant or non-equivariant network updates, ensuring that TD errors arising from symmetry assumption violations are not globally propagated (Chang et al., 30 Nov 2025).
5. PI-MDP Algorithmic Instantiations
The framework yields Partially Equivariant RL algorithms:
PE-DQN (discrete control): The critic blends outputs of an equivariant (e.g., MDP-homomorphic) and a non-equivariant MLP critic, updating via DQN loss on the mixture and employing -greedy exploration. One-step predictors and gates are trained online from the replay buffer (Chang et al., 30 Nov 2025).
PE-SAC (continuous control): Actor is specified as a product-of-experts (PoE) mixture,
with . Q-functions are similarly blended as in PE-DQN. Updates for critics, actor, predictors, and gates use separate parameter trunks and employ stochastic target updates (Chang et al., 30 Nov 2025).
Network architectures leverage EMLP or steerable CNNs for equivariant trunks, with standard MLPs for non-equivariant components. This design allows the system to exploit symmetry where valid, while robustly defaulting to empirically driven learning elsewhere.
6. Empirical Behavior and Sample Efficiency
Empirical evaluation demonstrates substantial improvements in sample efficiency and robustness across various domains:
| Domain | Symmetry Group | Symmetry Breaking Source | Result for PI-MDP |
|---|---|---|---|
| Grid-World (C₄) | Fixed obstacles, reward | 2–3× more sample-efficient than all baselines, graceful degradation as obstacles increase | |
| MuJoCo Locomotion | Asymmetric parameters, R(s) | Strict equivariant collapses under large breaking; PI-SAC consistently best | |
| Robotic Manipulation | Collisions, kinematic limits | PI-SAC robust; achieves 20–30% higher returns in partial symmetry breakage |
PE-DQN and PE-SAC consistently outperform vanilla RL, strict equivariant, and approximately equivariant baselines, especially in settings with moderate symmetry-breaking. When symmetries are nearly totally absent, PI-MDP naturally falls back to standard RL, matching baseline performance (Chang et al., 30 Nov 2025).
7. Broader Implications and Future Directions
PI-MDP formalism provides a theoretically justified, empirically validated approach for exploiting group-induced inductive biases selectively. The selective equivariance mechanism avoids catastrophic error propagation by reverting to unconstrained updates where group structure fails, leading to robust and efficient learning.
Current limitations include additional computational overhead from extra networks and gating, and reduced effect in environments where symmetry is mostly absent (e.g., gravity-dominated tasks). Prospective research directions are:
- Integration with pixel-based state representations and equivariant CNNs.
- Automatic discovery and maintenance of group structure .
- Adaptive, hierarchical gating across state–action space (Chang et al., 30 Nov 2025).
In sum, PI-MDPs represent a principled generalization of equivariant RL, delivering efficient sample utilization and robust value estimation in environments exhibiting partial or locally valid symmetries (Chang et al., 30 Nov 2025, Pol et al., 2020).