Papers
Topics
Authors
Recent
Search
2000 character limit reached

Equivariant Soft Actor-Critic (Equi-SAC)

Updated 19 December 2025
  • Equi-SAC is a reinforcement learning algorithm that embeds exact SO(2)-equivariance through steerable group convolutions in both actor and critic networks.
  • The architecture integrates equivariant convolutional layers and group-max pooling to ensure invariant value estimations and consistent policy outputs under rotational transformations.
  • Empirical evaluations demonstrate that Equi-SAC significantly improves sample efficiency and task performance in robotic visual-control benchmarks compared to standard CNN-based methods.

Equivariant Soft Actor-Critic (Equi-SAC) is a reinforcement learning (RL) algorithm that enforces exact SO(2)-equivariance in both actor and critic networks via steerable group convolutions. By embedding symmetry properties at the algorithmic and architectural levels, Equi-SAC achieves substantial improvements in sample efficiency, particularly in robotic manipulation tasks characterized by underlying rotational symmetries (Wang et al., 2022).

1. Theoretical Foundations

Equi-SAC is grounded in the mathematical theory of group actions and equivariant representations. Consider the group G=SO(2)G = SO(2), the group of planar rotations, approximated in practice by its discrete cyclic subgroup Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}.

States are encoded as mm-channel images: Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m, with group action gCng \in C_n implemented as rotational shifts in pixel space,

(gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),

where channel values follow the trivial representation ρ0\rho_0.

Actions aARka \in A \subset \mathbb{R}^k are partitioned as

a=(aequiv,ainv),a = (a_{\text{equiv}},\, a_{\text{inv}}),

with aequivR2a_{\text{equiv}} \in \mathbb{R}^2 transforming under the standard representation Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}0—that is,

Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}1

Under these definitions, the optimal value and policy functions exhibit strict group-theoretic structure: Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}2

Equi-SAC enforces these properties by construction:

  • Critic invariance: Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}3.
  • Actor equivariance: Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}4.

2. Network Architectures

Equivariant layers in Equi-SAC are implemented using group convolutions satisfying

Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}5

These layers are instantiated with the E2CNN library and respect rotational symmetries Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}6.

Actor Network

  • Input: Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}7 depth image (Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}8-channel trivial representation).
  • Core: Cn={Rotθθ=2πi/n,  i=0,,n1}C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}9 steerable convolutional layers (mm0 kernels, ReLU activations), equivariant under mm1.
  • Output: mm2 tensor comprising:
    • One mm3 vector (two degrees of freedom) for mm4.
    • Eight mm5 scalars for mm6.

The steerable basis enables a single convolution to handle all rotated filter versions.

Critic Network

  • Encoder: mm7 steerable convolutional layers mapping mm8, terminate in mm9 features (Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,0).
  • Action concatenation: Actor output (Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,1 vector Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,2 Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,3 Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,4 scalars) is appended to encoder output.
  • Heads: Two Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,5-functions, each a stack of two convolutional layers (regular to trivial), concluding with max-pooling over group channels to overcome Schur’s Lemma constraints.
  • Output: Two scalar Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,6 values Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,7.

Architecture Sketch

Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,8

3. Algorithm and Training Protocol

The Equi-SAC learning process follows an adaptation of the standard Soft Actor-Critic (SAC) framework, with explicit equivariant/invariant losses and update rules:

  1. Replay Buffer: Fs:R2Rm,\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,9-greedy replay buffer gCng \in C_n0, with optional demonstration data.
  2. Action Sampling: Actor (gCng \in C_n1) outputs mean and standard deviation; action generated via reparameterization trick. Equivariant components are rotated into a base frame before execution.
  3. Critic Update: The target value is computed using soft target networks and incorporates the log probability under the policy, maintaining critic invariance under gCng \in C_n2.
  4. Actor Update: The loss involves both entropy maximization and critic evaluation, enforcing actor equivariance through the network structure.
  5. Temperature Update: Entropy temperature gCng \in C_n3 is optionally updated to match target entropy.
  6. Target Network: Soft update (gCng \in C_n4) is used for critic target networks.

All convolutional operations inherit gCng \in C_n5-equivariance, ensuring that both actor and critic properly respect symmetry constraints throughout optimization.

Pseudocode (abridged)

ρ0\rho_00

4. Empirical Evaluation

Benchmarks

Equi-SAC was benchmarked on six robotic visual-control tasks:

  • Simple: Block Pulling, Object Picking, Drawer Opening
  • Hard: Block Stacking, House Building, Corner Picking

Each environment used gCng \in C_n6 depth images as state observations, continuous action spaces gCng \in C_n7, and sparse rewards (+1 for task success).

Sample Efficiency

Steps required to achieve 80% success rate (mean across 4 seeds):

Task Equi-SAC CNN-SAC DrQ RAD FERM
Block Pull 45k 180k 150k 170k 130k
Object Pick 80k × × × ×
Drawer Open 90k × × × ×

(gCng \in C_n8 indicates failure to solve within 300k steps.)

Ablation Study

Effect of equivariant actors and critics (final reward after 200k steps):

EqActor+EqCrit EqActor+CNNCrit CNNActor+EqCrit
Pull 0.97 0.82 0.75
Pick 0.92 0.70 0.65
Open 0.95 0.78 0.72

gCng \in C_n9 symmetry (eightfold rotation) consistently outperformed (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),0 (fourfold) in 5 of 6 domains.

5. Implementation and Practical Considerations

Key engineering and hyperparameter guidelines:

  • Steerable CNNs: Use E2CNN or similar libraries for group convolutions. Specify input/output representations ((gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),1) to enable weight sharing across formal group actions.
  • Critic Pooling: Apply group-max pooling before the final scalar output to avoid overconstraint (Schur’s Lemma).
  • Action Augmentation: When employing buffer augmentations, rotate (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),2 by the same SO(2) element as the state.
  • Critical Hyperparameters:
    • Soft update rate (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),3
    • Actor and critic learning rates: (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),4 (Adam optimizer)
    • Entropy temperature initialization: (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),5, target entropy (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),6
    • Batch size: (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),7
    • Replay buffer capacity: (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),8

This structure ensures exact (gFs)(x,y)=Fs(g1(x,y)),(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),9-equivariance, resulting in marked improvements in both sample efficiency and final policy performance on visual-control benchmarks.

6. Significance and Context

Equi-SAC demonstrates that explicitly modeling group symmetry within RL architectures can lead to substantial empirical gains in data efficiency and task performance, particularly for domains where physical laws or robot/environment geometry induce SO(2)-invariant MDPs. These results provide a basis for adopting symmetry-preserving architectures in broader robot learning applications and for extending equivariant RL principles to other symmetry groups beyond SO(2) (Wang et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Equivariant Soft Actor-Critic (Equi-SAC).