Papers
Topics
Authors
Recent
Search
2000 character limit reached

Equivariant Sparse Autoencoders (E-SAEs)

Updated 5 January 2026
  • Equivariant Sparse Autoencoders are models that combine sparsity with group symmetry enforcement to yield structured latent representations sensitive to transformations like rotations.
  • They utilize a TopK sparse encoder and learn transformation matrices to align latent features with group actions, improving reconstruction quality and interpretability.
  • Experimental results demonstrate that E-SAEs achieve higher classification F1 scores and mechanistic insight compared to traditional non-equivariant autoencoders.

Equivariant Sparse Autoencoders (E-SAEs) are a class of representation learning models that combine the interpretability advantages of sparse autoencoders (SAEs) with explicit incorporation of group symmetries, such as spatial rotations. By enforcing equivariance constraints with respect to these symmetries, E-SAEs yield latent features that transform predictably under group actions, facilitating more structured decompositions and downstream interpretability, especially in domains where scientific data naturally exhibit such symmetries (Erdogan et al., 12 Nov 2025).

1. Mathematical Formulation of Group Equivariance

E-SAEs are built around the concept of GG-equivariance, where GG is a group representing symmetries (e.g., the cyclic group of 90° rotations, C4C_4). The group GG acts on the input space XRn\mathcal{X} \subset \mathbb{R}^n via

gxX,gG,xX,g \cdot x \in \mathcal{X}, \quad \forall g \in G, x \in \mathcal{X},

with the group axioms ex=xe \cdot x = x and (g1g2)x=g1(g2x)(g_1 g_2) \cdot x = g_1 \cdot (g_2 \cdot x). The encoder f:XRmf: \mathcal{X} \to \mathbb{R}^m and a linear representation ρ:GGL(m)\rho: G \to \mathrm{GL}(m) are trained such that

f(gx)=ρ(g)f(x)gG,xX.f(g \cdot x) = \rho(g) f(x) \quad \forall g \in G, x \in \mathcal{X}.

This ensures that the latent space is structured to reflect the symmetries present in the input domain.

2. E-SAE Architecture and Objective Functions

The E-SAE follows the TopK sparse autoencoder paradigm:

z=E(x)=TopK(W2σ(W1x+b1)+b2)Rm,z = E(x) = \mathrm{TopK}\big(W_2\,\sigma(W_1 x + b_1) + b_2\big) \in \mathbb{R}^m,

where TopK keeps only the KK largest entries, enforcing sparsity.

  • Decoder (DD): A linear map or “dictionary”

x^=D(z)=WDz+bDRn.\hat x = D(z) = W_D z + b_D \in \mathbb{R}^n.

Training is driven by a combined objective,

Ltotal=Lrec+λRsparse+γLeq,\mathcal{L}_\mathrm{total} = \mathcal{L}_\mathrm{rec} + \lambda \mathcal{R}_\mathrm{sparse} + \gamma \mathcal{L}_\mathrm{eq},

where:

  • Reconstruction loss: Lrec=ExDxD(E(x))22\mathcal{L}_\mathrm{rec} = \mathbb{E}_{x \sim \mathcal{D}} \| x - D(E(x)) \|_2^2,
  • Sparsity regularizer: Rsparse=ExDE(x)1\mathcal{R}_\mathrm{sparse} = \mathbb{E}_{x \sim \mathcal{D}} \| E(x) \|_1,
  • Equivariance loss: Leq=ExDgGE(gx)ρ(g)E(x)22\mathcal{L}_\mathrm{eq} = \mathbb{E}_{x \sim \mathcal{D}} \sum_{g \in G} \| E(g\cdot x) - \rho(g) E(x) \|_2^2, with λ,γ>0\lambda, \gamma > 0 as hyperparameters.

3. Estimation of Group Transformation Matrices

Focusing on rotational symmetry, the generator gg (e.g., 90° rotation) is represented as a single matrix M=ρ(g)Rm×mM = \rho(g) \in \mathbb{R}^{m \times m}, where higher powers are MpM^p. The action of MM on pretrained activations ψ(x)\psi(x) is fitted via least squares: M=argminMiψ(gxi)Mψ(xi)22=(iψ(gxi)ψ(xi))(iψ(xi)ψ(xi))1.M = \arg\min_M \sum_i \| \psi(g \cdot x_i) - M \psi(x_i) \|_2^2 = \left( \sum_i \psi(g \cdot x_i) \psi(x_i)^\top \right) \left( \sum_i \psi(x_i) \psi(x_i)^\top \right)^{-1}. In practice, MM is initialized to the identity and optimized with Adam.

4. Adaptive Equivariance Strategy

E-SAE introduces adaptivity by decoupling invariance and equivariance learning. Initially, an invariant SAE is trained to decode all group-transformed activations to the canonical ψ(x)\psi(x), with objective

Linv=Ex,pψ(x)D(E(ψ(gpx)))22+λE(ψ())1.\mathcal{L}_\mathrm{inv} = \mathbb{E}_{x, p} \left\| \psi(x) - D(E(\psi(g^p \cdot x))) \right\|_2^2 + \lambda \| E(\psi(\cdot)) \|_1.

Separately, MM is fit to model the true movement of base-model activations under group action: minE,DLinv,minMLM.\min_{E, D} \mathcal{L}_\mathrm{inv}, \quad \min_M \mathcal{L}_M. This adaptivity allows E-SAE to align with the degree of equivariance in the base model without imposing exact symmetry, supporting flexible modeling.

5. Experimental Benchmarks and Performance

The empirical analysis used a synthetic dataset of 10,000 64×6464 \times 64 images, each containing four of eight possible shapes, arranged with rotation symmetries. Two base autoencoders (MLP-AE, CNN-AE) were evaluated with a bottleneck of m=256m=256.

Probe performance was measured on 180 binary classification tasks:

  • S: Shape present (rotation invariant)
  • SP: Shape in a specific quadrant
  • SO: Shape in specific orientation
  • SPO: Shape in quadrant and orientation

Baselines included regular one-layer SAEs and “wide” SAEs replicating latents for each group element.

Results demonstrated that E-SAE consistently outperformed baselines on reconstructed activations, with the learned transformation matrix MM achieving R2=0.987±0.001R^2 = 0.987 \pm 0.001 in predicting ψ(gpx)\psi(g^p x) from ψ(x)\psi(x) for the CNN-AE, versus R20.05R^2 \approx 0.05 for M=IM = I. Average F1 for E-SAE on the CNN-AE with truncation 32 and K=16K = 16 was approximately $0.88$, compared to $0.82$ for the best regular SAE. Non-equivariant two-layer encoders improved performance over one-layer versions, but still trailed E-SAE—especially on orientation-sensitive (SO/SPO) tasks.

A trade-off observed is that E-SAE slightly reduces code sparsity and increases reconstruction error compared to non-equivariant SAEs, but yields significantly more semantically informative latent representations.

6. Impact on Mechanistic Interpretability

By enforcing group equivariance, E-SAE produces sparse representations that naturally decompose into:

  • Invariant features: Stable across the group orbit (e.g., shape identity),
  • Equivariant features: Transformed by MM under group action (e.g., orientation or position).

This decomposition aligns model features with human concepts and yields improvements in task-specific probe performance. Importantly, the transformed dictionary (MpDM^p D) allows direct inspection of how features move under the group, supporting scientific interpretation and model transparency.

A plausible implication is that adaptively enforcing equivariance not only supports interpretability but also enables feature spaces better suited for downstream analysis in scientific and symmetry-structured domains (Erdogan et al., 12 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Equivariant Sparse Autoencoders (E-SAEs).