Papers
Topics
Authors
Recent
2000 character limit reached

Equivariant Diffusion Policies (EDPs)

Updated 15 December 2025
  • EDPs are visuomotor policy learning methods that combine score-based diffusion with explicit symmetry constraints for geometric equivariance.
  • They leverage groups such as SO(2), SO(3), SE(3), and SIM(3) to ensure consistent policy outputs under various spatial transformations.
  • Empirical results demonstrate high sample efficiency and robust generalization in both simulation and real-world robotic manipulation tasks.

Equivariant Diffusion Policies (EDPs) are a class of visuomotor policy learning methods that integrate score-based diffusion models with explicit symmetry constraints, imposing equivariance with respect to geometric transformation groups such as SO(2), SO(3), SE(3), or SIM(3). By leveraging domain-appropriate symmetries, EDPs achieve superior sample efficiency and generalization in both simulation and real-world robotic manipulation tasks. The theoretical core of EDPs is the design of denoising (score) networks whose outputs transform predictably under group actions, thereby conferring equivariance at the level of both the policy and the entire diffusion process. This enforces consistency across symmetric task configurations and permits robust policy extraction from limited demonstration data.

1. Theoretical Foundations of Equivariance in Diffusion Models

Central to EDPs is the concept of group equivariance: a neural network ϵθ\epsilon_\theta is GG-equivariant if, for a group GG acting on both observations and actions, it holds that

ϵθ(go,gak,k)=gϵθ(o,ak,k),gG,\epsilon_\theta(g \cdot o,\, g \cdot a^k,\, k) = g \cdot \epsilon_\theta(o,\, a^k,\, k),\quad \forall g \in G,

where oo denotes the observation, aka^k the noisy action at step kk, and gg \cdot the group action (Wang et al., 1 Jul 2024). The diffusion process remains GG-equivariant under marginalizing over the noise, provided the expert demonstrations are themselves GG-equivariant: π(go)=gπ(o).\pi(g \cdot o) = g \cdot \pi(o). This property is established for a range of groups, including SO(2) (planar rotations) (Wang et al., 1 Jul 2024), SO(3) (rotations in 3D) (Zhu et al., 2 Jul 2025), SE(3) (rigid transforms) (Ryu et al., 2023, Tie et al., 6 Nov 2024), and SIM(3) (rigid + uniform scale) (Yang et al., 1 Jul 2024).

Score-based diffusion models, notably DDPMs, add Gaussian noise to the demonstration actions through a forward Markov process: q(akak1)=N(ak;1βkak1,βkI)q(a^k|a^{k-1}) = \mathcal{N}\left(a^k;\, \sqrt{1 - \beta_k}\, a^{k-1},\, \beta_k I\right) and learn a reverse denoising process to reconstruct the clean action sequence. Equivariant diffusion models ensure that every step, including both the forward noising and reverse denoising, respects the target symmetry.

2. Architectures and Implementation of Equivariant Diffusion Policies

EDP architectures incorporate equivariance primarily at the observation encoder, action encoder, and the denoising network. Several strategies are established:

  • Group convolutional and linear layers: Using frameworks like escnn, layers are constructed so that weights satisfy Wρin(g)=ρout(g)WW \rho_{\text{in}}(g) = \rho_{\text{out}}(g) W, ensuring transformations commute with group actions (Wang et al., 1 Jul 2024).
  • G-equivariant feature maps: Both observations and noisy actions are encoded into group-feature representations (e.g., regular representations for CuC_u subgroups of SO(2)), which are processed with equivariant 1D temporal U-Nets and decoders (Wang et al., 1 Jul 2024).
  • SIM(3) and SE(3) Canonicalization: For higher-dimensional groups, features are normalized for translation and scale using centroid and scale estimation, placing scene and proprioceptive information in a canonical frame before further processing (Yang et al., 1 Jul 2024).
  • Spatiotemporal Spherical Fourier networks: For continuous SO(3) and SE(3) equivariance, states and actions are embedded in spherical Fourier space, with Spherical FiLM layers and spatiotemporal U-Nets ensuring equivariant channel mixing (Zhu et al., 2 Jul 2025).
  • Bi-equivariant GNNs: For diffusion directly on SE(3), networks compute field representations invariant/equivariant under both left and right actions, crucial for SE(3) policy extraction via Langevin sampling (Ryu et al., 2023).

Table: Representative EDP architectural strategies

Paper Symmetry Group Encoder Denoiser/U-Net
(Wang et al., 1 Jul 2024) SO(2) escnn/Cu_u G-conv Equivariant U-Net
(Yang et al., 1 Jul 2024) SIM(3) PointNet++ variant SO(3)-equivariant U-Net
(Zhu et al., 2 Jul 2025) SE(3) EquiformerV2 Spherical Fourier U-Net
(Tie et al., 6 Nov 2024) SE(3) SE(3)-Transformer Invariant + Equivariant
(Ryu et al., 2023) SE(3) SE(3) GNN fields Bi-equivariant U-Net

Implementation may also leverage group-invariant representations, such as relative or delta trajectory action parameterizations in SE(3), which intrinsically encode symmetries and enable simple (non-equivariant) diffusion heads to yield equivariant policies when paired with equivariant encoders (Wang et al., 19 May 2025).

3. Training and Inference Procedures

Training EDPs typically involves a diffusion score-matching loss: L(θ)=Eo,a,k,ϵϵθ(o,ak,k)ϵ2L(\theta) = \mathbb{E}_{o,a,k,\epsilon}\left\|\epsilon_\theta\left(o,\, a^k,\, k\right) - \epsilon\right\|^2 with ak=αˉka+1αˉkϵa^k = \sqrt{\bar{\alpha}_k} a + \sqrt{1 - \bar{\alpha}_k}\, \epsilon, ϵN(0,I)\epsilon \sim \mathcal{N}(0, I). In some instances, an auxiliary imitation (behavior cloning) loss is included (Yang et al., 1 Jul 2024).

Inference proceeds via iterated denoising, following the exact reverse diffusion dynamics, and, where necessary, canonicalizing the output action to the global coordinate frame (Yang et al., 1 Jul 2024, Tie et al., 6 Nov 2024). For certain architectures, policy extraction uses Langevin MCMC sampling on SE(3), yielding physically consistent 6-DoF end-effector poses (Ryu et al., 2023).

The use of relative/delta action encodings in conjunction with eye-in-hand perception guarantees equivariance even if only the encoder is constructed to respect the relevant group, facilitating efficient learning and scalable implementation (Wang et al., 19 May 2025).

4. Empirical Evaluation and Benchmarking

EDPs are evaluated on simulated suites (e.g., MimicGen, Robomimic) and real-robot manipulation tasks.

  • Sample efficiency: EDPs consistently demonstrate high success rates with fewer training samples compared to standard (non-equivariant) diffusion policies. For example, SO(2)-equivariant EDPs achieve a +21.9%+21.9\% higher mean success rate on 12 MimicGen tasks versus baselines at 100 demonstrations (Wang et al., 1 Jul 2024). SIM(3)-equivariant models retain high performance using just 25 demonstrations (Yang et al., 1 Jul 2024), and SE(3)-equivariant methods require as few as 5–10 demonstrations to reach >90%>90\% real-robot success rates on representative tasks (Ryu et al., 2023).
  • Generalization: EDPs exhibit robust performance under domain shifts corresponding to novel scene rotations, translations, and scaling. For example, SIM(3)-equivariant EDPs show <5%<5\% performance drop across all out-of-distribution (OOD) settings; non-equivariant baselines drop by $30$–60%60\% (Yang et al., 1 Jul 2024). Spherical Fourier EDPs maintain $0.92$ average success under SE(3) tilting versus $0.45$ for C8_8 equivariant baselines (Zhu et al., 2 Jul 2025).
  • Ablations: Removing explicit equivariance or switching to non-symmetric encoders or action parameterizations incurs $10$–18%18\% relative drops in task success (Wang et al., 1 Jul 2024, Wang et al., 19 May 2025, Zhu et al., 2 Jul 2025).

5. Steering and Fine-Tuning: Equivariant RL with EDPs

"Steering" refers to optimizing pre-trained EDPs on downstream reward signals via reinforcement learning. The symmetry-aware steering framework recognizes that if the EDP and environment dynamics are GG-equivariant, the corresponding latent-noise MDP inherits group-invariant reward and transition structure (Park et al., 12 Dec 2025). Three steering strategies are compared:

  • Standard RL (no symmetry constraints): High sample complexity, instability in value estimation, and poor OOD generalization.
  • Strict equivariant RL: Actor networks are GG-equivariant, critics are GG-invariant; sample efficient and stable but brittle under real-world symmetry breaking.
  • Approximate equivariant RL: Includes both equivariant and non-equivariant residuals, trading off stability and robustness for imperfect symmetry (Park et al., 12 Dec 2025).

Empirically, strict and approximate equivariant steering achieves rapid policy improvement, particularly in high-symmetry tasks. For instance, "Equi-DSRL" and "Approx-Equi-DSRL" steering lead to peak success rates (Lift: $0.84/0.82$; Stack D1: $0.73/0.80$; Square D2: $0.64/0.60$), while standard RL lags behind and often exhibits training divergence.

6. Design Choices, Practical Guidelines, and Limitations

  • Choice of symmetry group: Selection depends on task invariances—SO(2) for planar, SO(3) for full orientation, SE(3) for rigid body, SIM(3) for similarity. Incorrect or overly strict symmetry constraints can be detrimental when the environment breaks assumed symmetries (e.g., due to joint limits, asymmetric dynamics) (Park et al., 12 Dec 2025).
  • Equivariant vs. invariant representations: Relative/delta action representations with eye-in-hand perception ensure SE(3)-invariance and are easier to implement, achieving success rates within 2.5%2.5\% of full voxel-based SE(3)-equivariant models, while enabling the use of simple U-Net diffusion heads (Wang et al., 19 May 2025).
  • Network complexity: Full end-to-end equivariant architectures (e.g., SE(3) spherical U-Nets) incur higher implementation complexity and are most justified for tasks with strong, explicit symmetries. Frame Averaging offers a computationally inexpensive alternative for incorporating symmetry in pre-trained vision encoders (Wang et al., 19 May 2025).
  • Future directions: Exploring broader groups (e.g., reflection, permutation), accommodating approximate symmetry, and extending to hierarchical or multi-task scenarios remains open. Efficient equivariant architectures for high-dimensional or continuous groups and integration with online fine-tuning in partially symmetric real-world settings are prominent challenges (Wang et al., 1 Jul 2024, Park et al., 12 Dec 2025, Tie et al., 6 Nov 2024).

7. Comparative Summary of Major EDP Variants

Approach/Paper Symmetry Core Encoder Action Rep Notable Result
(Wang et al., 1 Jul 2024) Equivariant Diff. Policy SO(2), Cu_u G-equivariant CNN Absolute/relative +21.9%+21.9\% vs DP on MimicGen; robust to low data
(Yang et al., 1 Jul 2024) EquiBot SIM(3) PointNet++ Absolute <5%<5\% OOD drop, 80%80\% success with 10 demos
(Zhu et al., 2 Jul 2025) Spherical Diff. Policy SE(3) EquiformerV2 Relative $0.92$ success, +61%+61\% vs EquiDiff
(Wang et al., 19 May 2025) Practical Guide SE(3) (via rel) FrameAvg G-CNN/ResNet Relative/delta Simple to implement, +14.7%+14.7\% gain over baseline
(Tie et al., 6 Nov 2024) ET-SEED SE(3) SE(3)-Transformer Absolute >70%>70\% generalization under new pose
(Ryu et al., 2023) Diffusion-EDFs SE(3) Bi-equivariant GNN Absolute* >90%>90\% real-robot success with 5–10 demos

*Actions operated on the SE(3) manifold (Brownian motion).

References

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Equivariant Diffusion Policies (EDPs).