Equivariant Soft Actor-Critic (Equi-SAC)

Updated 19 December 2025

Equi-SAC is a reinforcement learning algorithm that embeds exact SO(2)-equivariance through steerable group convolutions in both actor and critic networks.
The architecture integrates equivariant convolutional layers and group-max pooling to ensure invariant value estimations and consistent policy outputs under rotational transformations.
Empirical evaluations demonstrate that Equi-SAC significantly improves sample efficiency and task performance in robotic visual-control benchmarks compared to standard CNN-based methods.

Equivariant Soft Actor-Critic (Equi-SAC) is a reinforcement learning (RL) algorithm that enforces exact SO(2)-equivariance in both actor and critic networks via steerable group convolutions. By embedding symmetry properties at the algorithmic and architectural levels, Equi-SAC achieves substantial improvements in sample efficiency, particularly in robotic manipulation tasks characterized by underlying rotational symmetries (Wang et al., 2022).

1. Theoretical Foundations

Equi-SAC is grounded in the mathematical theory of group actions and equivariant representations. Consider the group $G = SO(2)$ , the group of planar rotations, approximated in practice by its discrete cyclic subgroup $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ .

States are encoded as $m$ -channel images: $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ with group action $g \in C_n$ implemented as rotational shifts in pixel space,

$(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$

where channel values follow the trivial representation $\rho_0$ .

Actions $a \in A \subset \mathbb{R}^k$ are partitioned as

$a = (a_{\text{equiv}},\, a_{\text{inv}}),$

with $a_{\text{equiv}} \in \mathbb{R}^2$ transforming under the standard representation $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 0—that is,

$C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 1

Under these definitions, the optimal value and policy functions exhibit strict group-theoretic structure: $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 2

Equi-SAC enforces these properties by construction:

Critic invariance: $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 3.
Actor equivariance: $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 4.

2. Network Architectures

Equivariant layers in Equi-SAC are implemented using group convolutions satisfying

$C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 5

These layers are instantiated with the E2CNN library and respect rotational symmetries $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 6.

Actor Network

Input: $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 7 depth image ( $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 8-channel trivial representation).
Core: $C_n = \{\mathrm{Rot}_\theta \mid \theta = 2\pi i/n,\; i=0,\dots,n-1\}$ 9 steerable convolutional layers ( $m$ 0 kernels, ReLU activations), equivariant under $m$ 1.
Output: $m$ $m$ 2 tensor comprising:
- One $m$ 3 vector (two degrees of freedom) for $m$ 4.
- Eight $m$ 5 scalars for $m$ 6.

The steerable basis enables a single convolution to handle all rotated filter versions.

Critic Network

Encoder: $m$ 7 steerable convolutional layers mapping $m$ 8, terminate in $m$ 9 features ( $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 0).
Action concatenation: Actor output ( $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 1 vector $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 2 $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 3 $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 4 scalars) is appended to encoder output.
Heads: Two $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 5-functions, each a stack of two convolutional layers (regular to trivial), concluding with max-pooling over group channels to overcome Schur’s Lemma constraints.
Output: Two scalar $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 6 values $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 7.

Architecture Sketch

$\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 8

3. Algorithm and Training Protocol

The Equi-SAC learning process follows an adaptation of the standard Soft Actor-Critic (SAC) framework, with explicit equivariant/invariant losses and update rules:

Replay Buffer: $\mathcal{F}_s : \mathbb{R}^2 \to \mathbb{R}^m,$ 9-greedy replay buffer $g \in C_n$ 0, with optional demonstration data.
Action Sampling: Actor ( $g \in C_n$ 1) outputs mean and standard deviation; action generated via reparameterization trick. Equivariant components are rotated into a base frame before execution.
Critic Update: The target value is computed using soft target networks and incorporates the log probability under the policy, maintaining critic invariance under $g \in C_n$ 2.
Actor Update: The loss involves both entropy maximization and critic evaluation, enforcing actor equivariance through the network structure.
Temperature Update: Entropy temperature $g \in C_n$ 3 is optionally updated to match target entropy.
Target Network: Soft update ( $g \in C_n$ 4) is used for critic target networks.

All convolutional operations inherit $g \in C_n$ 5-equivariance, ensuring that both actor and critic properly respect symmetry constraints throughout optimization.

Pseudocode (abridged)

$\rho_0$ 0

4. Empirical Evaluation

Benchmarks

Equi-SAC was benchmarked on six robotic visual-control tasks:

Simple: Block Pulling, Object Picking, Drawer Opening
Hard: Block Stacking, House Building, Corner Picking

Each environment used $g \in C_n$ 6 depth images as state observations, continuous action spaces $g \in C_n$ 7, and sparse rewards (+1 for task success).

Sample Efficiency

Steps required to achieve 80% success rate (mean across 4 seeds):

Task	Equi-SAC	CNN-SAC	DrQ	RAD	FERM
Block Pull	45k	180k	150k	170k	130k
Object Pick	80k	×	×	×	×
Drawer Open	90k	×	×	×	×

( $g \in C_n$ 8 indicates failure to solve within 300k steps.)

Ablation Study

Effect of equivariant actors and critics (final reward after 200k steps):

	EqActor+EqCrit	EqActor+CNNCrit	CNNActor+EqCrit
Pull	0.97	0.82	0.75
Pick	0.92	0.70	0.65
Open	0.95	0.78	0.72

$g \in C_n$ 9 symmetry (eightfold rotation) consistently outperformed $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 0 (fourfold) in 5 of 6 domains.

5. Implementation and Practical Considerations

Key engineering and hyperparameter guidelines:

Steerable CNNs: Use E2CNN or similar libraries for group convolutions. Specify input/output representations ( $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 1) to enable weight sharing across formal group actions.
Critic Pooling: Apply group-max pooling before the final scalar output to avoid overconstraint (Schur’s Lemma).
Action Augmentation: When employing buffer augmentations, rotate $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 2 by the same SO(2) element as the state.
Critical Hyperparameters:
- Soft update rate $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 3
- Actor and critic learning rates: $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 4 (Adam optimizer)
- Entropy temperature initialization: $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 5, target entropy $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 6
- Batch size: $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 7
- Replay buffer capacity: $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 8

This structure ensures exact $(g\mathcal{F}_s)(x, y) = \mathcal{F}_s\bigl(g^{-1}(x, y)\bigr),$ 9-equivariance, resulting in marked improvements in both sample efficiency and final policy performance on visual-control benchmarks.

6. Significance and Context

Equi-SAC demonstrates that explicitly modeling group symmetry within RL architectures can lead to substantial empirical gains in data efficiency and task performance, particularly for domains where physical laws or robot/environment geometry induce SO(2)-invariant MDPs. These results provide a basis for adopting symmetry-preserving architectures in broader robot learning applications and for extending equivariant RL principles to other symmetry groups beyond SO(2) (Wang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

$\mathrm{SO}(2)$-Equivariant Reinforcement Learning (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Equivariant Soft Actor-Critic (Equi-SAC).