Papers
Topics
Authors
Recent
Search
2000 character limit reached

PolySAE: Sparse Autoencoder with Polynomial Decoding

Updated 8 February 2026
  • PolySAE is a sparse autoencoder variant that extends linear reconstructions by incorporating quadratic and cubic terms to model pairwise and triple feature interactions.
  • It employs a low-rank shared‐subspace factorization to efficiently approximate high‐order tensors and reduce parameter overhead in the decoding process.
  • Empirical evaluations demonstrate an 8%-10% F1 improvement and enhanced semantic separation, all achieved with a minimal increase in decoder complexity.

PolySAE is a sparse autoencoder (SAE) variant designed to capture the compositional structure in neural network representations by extending the linear reconstruction found in traditional SAEs to include higher-order polynomial feature interactions. Unlike classic SAEs, which decompose activations into sparse superpositions of additive dictionary atoms, PolySAE introduces quadratic and cubic decoding terms that enable modeling of pairwise and triple feature bindings. This extension preserves the interpretability provided by a linear encoder but allows the decoder to represent meanings not expressible in a purely additive basis, such as compounds, morphological binding, and multi-entity composition within LLM activations (Koromilas et al., 1 Feb 2026).

1. Motivation and Theoretical Foundations

Sparse autoencoders are widely used for interpreting the superposed internal representations of LLMs. These models express a given activation vector x∈Rdx \in \mathbb{R}^d as a sparse code z∈Rdsaez \in \mathbb{R}^{d_{\rm sae}}, with WencW_{\rm enc} and WdecW_{\rm dec} constituting the encoder and decoder, respectively. The conventional form:

z=S(ReLU(Wenc⊤x+benc))z = S(\mathrm{ReLU}(W_{\rm enc}^\top x + b_{\rm enc}))

x^=Wdecz\hat{x} = W_{\rm dec} z

where SS is a sparsifier (e.g., Top-KK operator), produces reconstructions via linear combination of dictionary atoms. However, the additive structure cannot disambiguate true compositional semantics from simple co-occurrence. For instance, distinguishing "Starbucks" as a composition of "star" and "coffee" features is not feasible with purely linear models; such SAEs must allocate dedicated features for the compound, compromising atomicity.

PolySAE addresses this by incorporating polynomial decoding: it enables the decoder to express both zizjz_i z_j and zizjzkz_i z_j z_k interactions, thus enriching the representational power without compromising sparse, interpretable encodings.

2. Polynomial Decoder Architecture

Let z=(z1,...,zdsae)z = (z_1, ..., z_{d_{\rm sae}}) denote the output of the linear encoder and sparsifier. The PolySAE decoder reconstructs activations as:

x^=W(1)z+λ2∑i<jzizjPij+λ3∑i<j<kzizjzkTijk\hat{x} = W^{(1)} z + \lambda_2 \sum_{i < j} z_i z_j P_{ij} + \lambda_3 \sum_{i < j < k} z_i z_j z_k T_{ijk}

with:

  • x(1)=W(1)zx^{(1)} = W^{(1)} z (linear term, typically W(1)=WdecW^{(1)} = W_{\rm dec})
  • x(2)=∑i<jzizjPijx^{(2)} = \sum_{i < j} z_i z_j P_{ij} (quadratic term)
  • x(3)=∑i<j<kzizjzkTijkx^{(3)} = \sum_{i < j < k} z_i z_j z_k T_{ijk} (cubic term)
  • Îť2,Îť3\lambda_2, \lambda_3 are learned scalars modulating higher-order contributions.

Storage of Pij∈RdP_{ij} \in \mathbb{R}^d and Tijk∈RdT_{ijk} \in \mathbb{R}^d naively scales as O(dsae2)O(d_{\rm sae}^2) and O(dsae3)O(d_{\rm sae}^3), prohibitive for dsae∟104d_{\rm sae} \sim 10^4. PolySAE circumvents this via low-rank tensor factorization on a shared projection subspace.

3. Low-Rank Shared-Subspace Factorization

Feature interaction tensors are approximated in a low-rank fashion using a shared basis. Defining U∈Rdsae×R1U \in \mathbb{R}^{d_{\rm sae} \times R_1} (feature-projection matrix, typically R1=dR_1 = d), project zz as α=U⊤z∈RR1\alpha = U^\top z \in \mathbb{R}^{R_1}. Output-projection matrices V(1)∈Rd×R1V^{(1)} \in \mathbb{R}^{d \times R_1}, V(2)∈Rd×R2V^{(2)} \in \mathbb{R}^{d \times R_2}, V(3)∈Rd×R3V^{(3)} \in \mathbb{R}^{d \times R_3} with R1≥R2≥R3R_1 \geq R_2 \geq R_3 parameterize outputs. The reconstruction is

x(1)=V(1)α,x(2)=V(2)(α∗α),x(3)=V(3)(α∗α∗α)x^{(1)} = V^{(1)} \alpha,\quad x^{(2)} = V^{(2)} (\alpha * \alpha),\quad x^{(3)} = V^{(3)} (\alpha * \alpha * \alpha)

with ∗* denoting elementwise product. This construction yields tensor factorizations:

Pij≈∑r=1R2Ui,rUj,rV:,r(2)P_{ij} \approx \sum_{r=1}^{R_2} U_{i,r} U_{j,r} V^{(2)}_{:,r}

Tijk≈∑r=1R3Ui,rUj,rUk,rV:,r(3)T_{ijk} \approx \sum_{r=1}^{R_3} U_{i,r} U_{j,r} U_{k,r} V^{(3)}_{:,r}

All interaction orders share UU, ensuring interaction structure remains aligned with learned features. Enforcing U⊤U=IU^\top U = I (orthonormality) prevents degeneracy. Empirical ablations show that increasing R2,R3R_2, R_3 beyond ∼0.1R1\sim 0.1 R_1 yields negligible reconstruction gains, supporting the hypothesis that interaction structure is low-rank in practice.

4. Training Objective and Computational Considerations

PolySAE is trained with a reconstruction loss plus optional sparsity regularization:

z(b)=S(ReLU(Wenc⊤x(b)+benc))z^{(b)} = S(\mathrm{ReLU}(W_{\rm enc}^\top x^{(b)} + b_{\rm enc}))

x^(b)=PolySAE decoder\hat{x}^{(b)} = \text{PolySAE decoder}

L=1B∑b=1B∥x(b)−x^(b)∥22+(optionally)λℓ1∥z(b)∥1\mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \| x^{(b)} - \hat{x}^{(b)} \|_2^2 + {\text{(optionally)}} \lambda_{\ell_1} \|z^{(b)}\|_1

Typically, hard KK-sparsity is used (no ℓ1\ell_1 penalty), with Adam optimizer (learning rate 3×10−43 \times 10^{-4}). UU is maintained orthonormal via positive-QR retraction post-update.

The parameter overhead, for R1=dR_1 = d and R2=R3≈0.1dR_2 = R_3 \approx 0.1 d, adds ∼1.2d2\sim 1.2 d^2 parameters. On GPT-2 Small (d=768d = 768), this results in a decoder parameter increase of only ∼2.5\sim 2.5–3%3\% over a vanilla Top-KK SAE. This efficiency makes PolySAE tractable for large dictionary widths (dsae∼16,384d_{\rm sae} \sim 16,384).

5. Empirical Evaluation

PolySAE was evaluated on residual activations from four LLMs (GPT-2 Small, Pythia-410M, Pythia-1.4B, Gemma-2-2B) and three sparsifiers (TopK, BatchTopK, Matryoshka), with K=64K=64 and dsae=16, ⁣384d_{\rm sae}=16,\!384. Key empirical findings:

  • Reconstruction Error: MSE remains nearly unchanged (e.g., on GPT-2: SAE MSE =0.52= 0.52, PolySAE MSE =0.55= 0.55).
  • Probing F1 Score: Average improvement of ≈8%\approx 8\% in F1 on six linguistic tasks across all models and sparsifiers (e.g., GPT-2 TopK: 67.1%→77.9%67.1\% \rightarrow 77.9\%).
  • Distributional Separation: Wasserstein distances between class-conditional feature codes increase by $2$-10×10\times, indicating better separation of semantic classes.
  • Compositionality vs. Co-occurrence: PolySAE’s learned quadratic interaction strengths (BijB_{ij}) have low correlation with empirical co-occurrence (r=0.06r=0.06), compared to vanilla SAE feature covariances (r=0.82r=0.82). This demonstrates capacity for capturing true compositional (not surface-level) interactions.
Metric SAE PolySAE Key Improvement
Reconstruction MSE (GPT-2) 0.52 0.55 ∟\simunchanged
Probing F1 (GPT-2, TopK) 67.1% 77.9% +10.8% absolute
Pearson rr (co-occurrence v. interaction) 0.82 0.06 Drastic decorrelation
Decoder Overhead (GPT-2) — ∼\sim3% Minimally increased size

6. Representative Feature Interaction Examples

Observed qualitative differences between PolySAE and standard SAEs underscore the model’s capability for capturing genuine composition:

Second-Order Interactions:

  • [star, stars] × [coffee, tea]: Correctly binding to “Starbucks” in appropriate contexts, contrasted with generic proper-noun firing in SAEs.
  • [surgery, repair] × [Trans, LGBT]: Specializes “surgery” under “Trans/LGBT,” tightening semantic scope.
  • [DNA, genetic] × [mod, mods]: Recovers the meaning of “genetic modification,” not just an agglomeration of "edit" features.

Third-Order Interactions:

  • [proved, proven] × [star, stars, superstar] × [reputation, fame]: Isolates relevant multiway semantic binding for sentences like “David Bowie proved some stars are big enough...”.
  • [black, racial] × [Americans, Canadians] × [people, women]: Disambiguates “Black Americans” intersection away from generic ethnicity features.

These interactions show that PolySAE allocates decoding capacity to morphology, phrasal semantics, and named-entity composition without proliferating monolithic dictionary atoms.

7. Summary and Significance

PolySAE introduces tractable, low-rank higher-order decoding to sparse autoencoders, enabling feature interactions that reflect genuine compositional semantics in LLM activations. It maintains the linear, interpretable encoder—critical for feature analysis—while enhancing decoder expressivity with minimal computational and parameter overhead (∼3%\sim 3\% on GPT-2 Small). PolySAE demonstrates robust empirical improvements: average +8%+8\% gain in probing F1, $2$–10×10\times better class separation, and nearly zero correlation with surface-level co-occurrence. Its design and evaluation suggest a substantial advance in the analysis of compositional structure in neural representations (Koromilas et al., 1 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolySAE.