Equivariant Sparse Autoencoders (E-SAEs)

Updated 5 January 2026

Equivariant Sparse Autoencoders are models that combine sparsity with group symmetry enforcement to yield structured latent representations sensitive to transformations like rotations.
They utilize a TopK sparse encoder and learn transformation matrices to align latent features with group actions, improving reconstruction quality and interpretability.
Experimental results demonstrate that E-SAEs achieve higher classification F1 scores and mechanistic insight compared to traditional non-equivariant autoencoders.

Equivariant Sparse Autoencoders (E-SAEs) are a class of representation learning models that combine the interpretability advantages of sparse autoencoders (SAEs) with explicit incorporation of group symmetries, such as spatial rotations. By enforcing equivariance constraints with respect to these symmetries, E-SAEs yield latent features that transform predictably under group actions, facilitating more structured decompositions and downstream interpretability, especially in domains where scientific data naturally exhibit such symmetries (Erdogan et al., 12 Nov 2025).

1. Mathematical Formulation of Group Equivariance

E-SAEs are built around the concept of $G$ -equivariance, where $G$ is a group representing symmetries (e.g., the cyclic group of 90° rotations, $C_4$ ). The group $G$ acts on the input space $\mathcal{X} \subset \mathbb{R}^n$ via

$g \cdot x \in \mathcal{X}, \quad \forall g \in G, x \in \mathcal{X},$

with the group axioms $e \cdot x = x$ and $(g_1 g_2) \cdot x = g_1 \cdot (g_2 \cdot x)$ . The encoder $f: \mathcal{X} \to \mathbb{R}^m$ and a linear representation $\rho: G \to \mathrm{GL}(m)$ are trained such that

$G$ 0

This ensures that the latent space is structured to reflect the symmetries present in the input domain.

2. E-SAE Architecture and Objective Functions

The E-SAE follows the TopK sparse autoencoder paradigm:

Encoder ( $G$ 1): A two-layer multilayer perceptron (MLP) with hidden dimension $G$ 2 and ReLU activation, outputting

$G$ 3

where TopK keeps only the $G$ 4 largest entries, enforcing sparsity.

Decoder ( $G$ 5): A linear map or “dictionary”

$G$ 6

Training is driven by a combined objective,

$G$ 7

where:

Reconstruction loss: $G$ 8,
Sparsity regularizer: $G$ 9,
Equivariance loss: $C_4$ 0, with $C_4$ 1 as hyperparameters.

3. Estimation of Group Transformation Matrices

Focusing on rotational symmetry, the generator $C_4$ 2 (e.g., 90° rotation) is represented as a single matrix $C_4$ 3, where higher powers are $C_4$ 4. The action of $C_4$ 5 on pretrained activations $C_4$ 6 is fitted via least squares: $C_4$ 7 In practice, $C_4$ 8 is initialized to the identity and optimized with Adam.

4. Adaptive Equivariance Strategy

E-SAE introduces adaptivity by decoupling invariance and equivariance learning. Initially, an invariant SAE is trained to decode all group-transformed activations to the canonical $C_4$ 9, with objective

$G$ 0

Separately, $G$ 1 is fit to model the true movement of base-model activations under group action: $G$ 2 This adaptivity allows E-SAE to align with the degree of equivariance in the base model without imposing exact symmetry, supporting flexible modeling.

5. Experimental Benchmarks and Performance

The empirical analysis used a synthetic dataset of 10,000 $G$ 3 images, each containing four of eight possible shapes, arranged with rotation symmetries. Two base autoencoders (MLP-AE, CNN-AE) were evaluated with a bottleneck of $G$ 4.

Probe performance was measured on 180 binary classification tasks:

S: Shape present (rotation invariant)
SP: Shape in a specific quadrant
SO: Shape in specific orientation
SPO: Shape in quadrant and orientation

Baselines included regular one-layer SAEs and “wide” SAEs replicating latents for each group element.

Results demonstrated that E-SAE consistently outperformed baselines on reconstructed activations, with the learned transformation matrix $G$ 5 achieving $G$ 6 in predicting $G$ 7 from $G$ 8 for the CNN-AE, versus $G$ 9 for $\mathcal{X} \subset \mathbb{R}^n$ 0. Average F1 for E-SAE on the CNN-AE with truncation 32 and $\mathcal{X} \subset \mathbb{R}^n$ 1 was approximately $\mathcal{X} \subset \mathbb{R}^n$ 2, compared to $\mathcal{X} \subset \mathbb{R}^n$ 3 for the best regular SAE. Non-equivariant two-layer encoders improved performance over one-layer versions, but still trailed E-SAE—especially on orientation-sensitive (SO/SPO) tasks.

A trade-off observed is that E-SAE slightly reduces code sparsity and increases reconstruction error compared to non-equivariant SAEs, but yields significantly more semantically informative latent representations.

6. Impact on Mechanistic Interpretability

By enforcing group equivariance, E-SAE produces sparse representations that naturally decompose into:

Invariant features: Stable across the group orbit (e.g., shape identity),
Equivariant features: Transformed by $\mathcal{X} \subset \mathbb{R}^n$ 4 under group action (e.g., orientation or position).

This decomposition aligns model features with human concepts and yields improvements in task-specific probe performance. Importantly, the transformed dictionary ( $\mathcal{X} \subset \mathbb{R}^n$ 5) allows direct inspection of how features move under the group, supporting scientific interpretation and model transparency.

A plausible implication is that adaptively enforcing equivariance not only supports interpretability but also enables feature spaces better suited for downstream analysis in scientific and symmetry-structured domains (Erdogan et al., 12 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Group Equivariance Meets Mechanistic Interpretability: Equivariant Sparse Autoencoders (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Equivariant Sparse Autoencoders (E-SAEs).