Equivariant Diffusion Policies (EDPs)

Updated 15 December 2025

EDPs are visuomotor policy learning methods that combine score-based diffusion with explicit symmetry constraints for geometric equivariance.
They leverage groups such as SO(2), SO(3), SE(3), and SIM(3) to ensure consistent policy outputs under various spatial transformations.
Empirical results demonstrate high sample efficiency and robust generalization in both simulation and real-world robotic manipulation tasks.

Equivariant Diffusion Policies (EDPs) are a class of visuomotor policy learning methods that integrate score-based diffusion models with explicit symmetry constraints, imposing equivariance with respect to geometric transformation groups such as SO(2), SO(3), SE(3), or SIM(3). By leveraging domain-appropriate symmetries, EDPs achieve superior sample efficiency and generalization in both simulation and real-world robotic manipulation tasks. The theoretical core of EDPs is the design of denoising (score) networks whose outputs transform predictably under group actions, thereby conferring equivariance at the level of both the policy and the entire diffusion process. This enforces consistency across symmetric task configurations and permits robust policy extraction from limited demonstration data.

1. Theoretical Foundations of Equivariance in Diffusion Models

Central to EDPs is the concept of group equivariance: a neural network $\epsilon_\theta$ is $G$ -equivariant if, for a group $G$ acting on both observations and actions, it holds that

$\epsilon_\theta(g \cdot o,\, g \cdot a^k,\, k) = g \cdot \epsilon_\theta(o,\, a^k,\, k),\quad \forall g \in G,$

where $o$ denotes the observation, $a^k$ the noisy action at step $k$ , and $g \cdot$ the group action (Wang et al., 1 Jul 2024). The diffusion process remains $G$ -equivariant under marginalizing over the noise, provided the expert demonstrations are themselves $G$ -equivariant: $\pi(g \cdot o) = g \cdot \pi(o).$ This property is established for a range of groups, including SO(2) (planar rotations) (Wang et al., 1 Jul 2024), SO(3) (rotations in 3D) (Zhu et al., 2 Jul 2025), SE(3) (rigid transforms) (Ryu et al., 2023, Tie et al., 6 Nov 2024), and SIM(3) (rigid + uniform scale) (Yang et al., 1 Jul 2024).

Score-based diffusion models, notably DDPMs, add Gaussian noise to the demonstration actions through a forward Markov process: $q(a^k|a^{k-1}) = \mathcal{N}\left(a^k;\, \sqrt{1 - \beta_k}\, a^{k-1},\, \beta_k I\right)$ and learn a reverse denoising process to reconstruct the clean action sequence. Equivariant diffusion models ensure that every step, including both the forward noising and reverse denoising, respects the target symmetry.

2. Architectures and Implementation of Equivariant Diffusion Policies

EDP architectures incorporate equivariance primarily at the observation encoder, action encoder, and the denoising network. Several strategies are established:

Group convolutional and linear layers: Using frameworks like escnn, layers are constructed so that weights satisfy $W \rho_{\text{in}}(g) = \rho_{\text{out}}(g) W$ , ensuring transformations commute with group actions (Wang et al., 1 Jul 2024).
G-equivariant feature maps: Both observations and noisy actions are encoded into group-feature representations (e.g., regular representations for $C_u$ subgroups of SO(2)), which are processed with equivariant 1D temporal U-Nets and decoders (Wang et al., 1 Jul 2024).
SIM(3) and SE(3) Canonicalization: For higher-dimensional groups, features are normalized for translation and scale using centroid and scale estimation, placing scene and proprioceptive information in a canonical frame before further processing (Yang et al., 1 Jul 2024).
Spatiotemporal Spherical Fourier networks: For continuous SO(3) and SE(3) equivariance, states and actions are embedded in spherical Fourier space, with Spherical FiLM layers and spatiotemporal U-Nets ensuring equivariant channel mixing (Zhu et al., 2 Jul 2025).
Bi-equivariant GNNs: For diffusion directly on SE(3), networks compute field representations invariant/equivariant under both left and right actions, crucial for SE(3) policy extraction via Langevin sampling (Ryu et al., 2023).

Table: Representative EDP architectural strategies

Paper	Symmetry Group	Encoder	Denoiser/U-Net
(Wang et al., 1 Jul 2024)	SO(2)	escnn/C $_u$ G-conv	Equivariant U-Net
(Yang et al., 1 Jul 2024)	SIM(3)	PointNet++ variant	SO(3)-equivariant U-Net
(Zhu et al., 2 Jul 2025)	SE(3)	EquiformerV2	Spherical Fourier U-Net
(Tie et al., 6 Nov 2024)	SE(3)	SE(3)-Transformer	Invariant + Equivariant
(Ryu et al., 2023)	SE(3)	SE(3) GNN fields	Bi-equivariant U-Net

Implementation may also leverage group-invariant representations, such as relative or delta trajectory action parameterizations in SE(3), which intrinsically encode symmetries and enable simple (non-equivariant) diffusion heads to yield equivariant policies when paired with equivariant encoders (Wang et al., 19 May 2025).

3. Training and Inference Procedures

Training EDPs typically involves a diffusion score-matching loss: $L(\theta) = \mathbb{E}_{o,a,k,\epsilon}\left\|\epsilon_\theta\left(o,\, a^k,\, k\right) - \epsilon\right\|^2$ with $a^k = \sqrt{\bar{\alpha}_k} a + \sqrt{1 - \bar{\alpha}_k}\, \epsilon$ , $\epsilon \sim \mathcal{N}(0, I)$ . In some instances, an auxiliary imitation (behavior cloning) loss is included (Yang et al., 1 Jul 2024).

Inference proceeds via iterated denoising, following the exact reverse diffusion dynamics, and, where necessary, canonicalizing the output action to the global coordinate frame (Yang et al., 1 Jul 2024, Tie et al., 6 Nov 2024). For certain architectures, policy extraction uses Langevin MCMC sampling on SE(3), yielding physically consistent 6-DoF end-effector poses (Ryu et al., 2023).

The use of relative/delta action encodings in conjunction with eye-in-hand perception guarantees equivariance even if only the encoder is constructed to respect the relevant group, facilitating efficient learning and scalable implementation (Wang et al., 19 May 2025).

4. Empirical Evaluation and Benchmarking

EDPs are evaluated on simulated suites (e.g., MimicGen, Robomimic) and real-robot manipulation tasks.

Sample efficiency: EDPs consistently demonstrate high success rates with fewer training samples compared to standard (non-equivariant) diffusion policies. For example, SO(2)-equivariant EDPs achieve a $+21.9\%$ higher mean success rate on 12 MimicGen tasks versus baselines at 100 demonstrations (Wang et al., 1 Jul 2024). SIM(3)-equivariant models retain high performance using just 25 demonstrations (Yang et al., 1 Jul 2024), and SE(3)-equivariant methods require as few as 5–10 demonstrations to reach $>90\%$ real-robot success rates on representative tasks (Ryu et al., 2023).
Generalization: EDPs exhibit robust performance under domain shifts corresponding to novel scene rotations, translations, and scaling. For example, SIM(3)-equivariant EDPs show $<5\%$ performance drop across all out-of-distribution (OOD) settings; non-equivariant baselines drop by $30$– $60\%$ (Yang et al., 1 Jul 2024). Spherical Fourier EDPs maintain $0.92$ average success under SE(3) tilting versus $0.45$ for C $_8$ equivariant baselines (Zhu et al., 2 Jul 2025).
Ablations: Removing explicit equivariance or switching to non-symmetric encoders or action parameterizations incurs $10$– $18\%$ relative drops in task success (Wang et al., 1 Jul 2024, Wang et al., 19 May 2025, Zhu et al., 2 Jul 2025).

5. Steering and Fine-Tuning: Equivariant RL with EDPs

"Steering" refers to optimizing pre-trained EDPs on downstream reward signals via reinforcement learning. The symmetry-aware steering framework recognizes that if the EDP and environment dynamics are $G$ -equivariant, the corresponding latent-noise MDP inherits group-invariant reward and transition structure (Park et al., 12 Dec 2025). Three steering strategies are compared:

Standard RL (no symmetry constraints): High sample complexity, instability in value estimation, and poor OOD generalization.
Strict equivariant RL: Actor networks are $G$ -equivariant, critics are $G$ -invariant; sample efficient and stable but brittle under real-world symmetry breaking.
Approximate equivariant RL: Includes both equivariant and non-equivariant residuals, trading off stability and robustness for imperfect symmetry (Park et al., 12 Dec 2025).

Empirically, strict and approximate equivariant steering achieves rapid policy improvement, particularly in high-symmetry tasks. For instance, "Equi-DSRL" and "Approx-Equi-DSRL" steering lead to peak success rates (Lift: $0.84/0.82$; Stack D1: $0.73/0.80$; Square D2: $0.64/0.60$), while standard RL lags behind and often exhibits training divergence.

6. Design Choices, Practical Guidelines, and Limitations

Choice of symmetry group: Selection depends on task invariances—SO(2) for planar, SO(3) for full orientation, SE(3) for rigid body, SIM(3) for similarity. Incorrect or overly strict symmetry constraints can be detrimental when the environment breaks assumed symmetries (e.g., due to joint limits, asymmetric dynamics) (Park et al., 12 Dec 2025).
Equivariant vs. invariant representations: Relative/delta action representations with eye-in-hand perception ensure SE(3)-invariance and are easier to implement, achieving success rates within $2.5\%$ of full voxel-based SE(3)-equivariant models, while enabling the use of simple U-Net diffusion heads (Wang et al., 19 May 2025).
Network complexity: Full end-to-end equivariant architectures (e.g., SE(3) spherical U-Nets) incur higher implementation complexity and are most justified for tasks with strong, explicit symmetries. Frame Averaging offers a computationally inexpensive alternative for incorporating symmetry in pre-trained vision encoders (Wang et al., 19 May 2025).
Future directions: Exploring broader groups (e.g., reflection, permutation), accommodating approximate symmetry, and extending to hierarchical or multi-task scenarios remains open. Efficient equivariant architectures for high-dimensional or continuous groups and integration with online fine-tuning in partially symmetric real-world settings are prominent challenges (Wang et al., 1 Jul 2024, Park et al., 12 Dec 2025, Tie et al., 6 Nov 2024).

7. Comparative Summary of Major EDP Variants

Approach/Paper	Symmetry	Core Encoder	Action Rep	Notable Result
(Wang et al., 1 Jul 2024) Equivariant Diff. Policy	SO(2), C $_u$	G-equivariant CNN	Absolute/relative	$+21.9\%$ vs DP on MimicGen; robust to low data
(Yang et al., 1 Jul 2024) EquiBot	SIM(3)	PointNet++	Absolute	$<5\%$ OOD drop, $80\%$ success with 10 demos
(Zhu et al., 2 Jul 2025) Spherical Diff. Policy	SE(3)	EquiformerV2	Relative	$0.92$ success, $+61\%$ vs EquiDiff
(Wang et al., 19 May 2025) Practical Guide	SE(3) (via rel)	FrameAvg G-CNN/ResNet	Relative/delta	Simple to implement, $+14.7\%$ gain over baseline
(Tie et al., 6 Nov 2024) ET-SEED	SE(3)	SE(3)-Transformer	Absolute	$>70\%$ generalization under new pose
(Ryu et al., 2023) Diffusion-EDFs	SE(3)	Bi-equivariant GNN	Absolute*	$>90\%$ real-robot success with 5–10 demos

*Actions operated on the SE(3) manifold (Brownian motion).

References

"Equivariant Diffusion Policy" (Wang et al., 1 Jul 2024)
"EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning" (Yang et al., 1 Jul 2024)
"SE(3)-Equivariant Diffusion Policy in Spherical Fourier Space" (Zhu et al., 2 Jul 2025)
"A Practical Guide for Incorporating Symmetry in Diffusion Policy" (Wang et al., 19 May 2025)
"ET-SEED: Efficient Trajectory-Level SE(3) Equivariant Diffusion Policy" (Tie et al., 6 Nov 2024)
"Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3)" (Ryu et al., 2023)
"Symmetry-Aware Steering of Equivariant Diffusion Policies: Benefits and Limits" (Park et al., 12 Dec 2025)