PolySAE: Sparse Autoencoder with Polynomial Decoding

Updated 8 February 2026

PolySAE is a sparse autoencoder variant that extends linear reconstructions by incorporating quadratic and cubic terms to model pairwise and triple feature interactions.
It employs a low-rank shared‐subspace factorization to efficiently approximate high‐order tensors and reduce parameter overhead in the decoding process.
Empirical evaluations demonstrate an 8%-10% F1 improvement and enhanced semantic separation, all achieved with a minimal increase in decoder complexity.

PolySAE is a sparse autoencoder (SAE) variant designed to capture the compositional structure in neural network representations by extending the linear reconstruction found in traditional SAEs to include higher-order polynomial feature interactions. Unlike classic SAEs, which decompose activations into sparse superpositions of additive dictionary atoms, PolySAE introduces quadratic and cubic decoding terms that enable modeling of pairwise and triple feature bindings. This extension preserves the interpretability provided by a linear encoder but allows the decoder to represent meanings not expressible in a purely additive basis, such as compounds, morphological binding, and multi-entity composition within LLM activations (Koromilas et al., 1 Feb 2026).

1. Motivation and Theoretical Foundations

Sparse autoencoders are widely used for interpreting the superposed internal representations of LLMs. These models express a given activation vector $x \in \mathbb{R}^d$ as a sparse code $z \in \mathbb{R}^{d_{\rm sae}}$ , with $W_{\rm enc}$ and $W_{\rm dec}$ constituting the encoder and decoder, respectively. The conventional form:

$z = S(\mathrm{ReLU}(W_{\rm enc}^\top x + b_{\rm enc}))$

$\hat{x} = W_{\rm dec} z$

where $S$ is a sparsifier (e.g., Top- $K$ operator), produces reconstructions via linear combination of dictionary atoms. However, the additive structure cannot disambiguate true compositional semantics from simple co-occurrence. For instance, distinguishing "Starbucks" as a composition of "star" and "coffee" features is not feasible with purely linear models; such SAEs must allocate dedicated features for the compound, compromising atomicity.

PolySAE addresses this by incorporating polynomial decoding: it enables the decoder to express both $z_i z_j$ and $z_i z_j z_k$ interactions, thus enriching the representational power without compromising sparse, interpretable encodings.

2. Polynomial Decoder Architecture

Let $z = (z_1, ..., z_{d_{\rm sae}})$ denote the output of the linear encoder and sparsifier. The PolySAE decoder reconstructs activations as:

$\hat{x} = W^{(1)} z + \lambda_2 \sum_{i < j} z_i z_j P_{ij} + \lambda_3 \sum_{i < j < k} z_i z_j z_k T_{ijk}$

with:

$x^{(1)} = W^{(1)} z$ (linear term, typically $W^{(1)} = W_{\rm dec}$ )
$x^{(2)} = \sum_{i < j} z_i z_j P_{ij}$ (quadratic term)
$x^{(3)} = \sum_{i < j < k} z_i z_j z_k T_{ijk}$ (cubic term)
$\lambda_2, \lambda_3$ are learned scalars modulating higher-order contributions.

Storage of $P_{ij} \in \mathbb{R}^d$ and $T_{ijk} \in \mathbb{R}^d$ naively scales as $O(d_{\rm sae}^2)$ and $O(d_{\rm sae}^3)$ , prohibitive for $d_{\rm sae} \sim 10^4$ . PolySAE circumvents this via low-rank tensor factorization on a shared projection subspace.

3. Low-Rank Shared-Subspace Factorization

Feature interaction tensors are approximated in a low-rank fashion using a shared basis. Defining $U \in \mathbb{R}^{d_{\rm sae} \times R_1}$ (feature-projection matrix, typically $R_1 = d$ ), project $z$ as $\alpha = U^\top z \in \mathbb{R}^{R_1}$ . Output-projection matrices $V^{(1)} \in \mathbb{R}^{d \times R_1}$ , $V^{(2)} \in \mathbb{R}^{d \times R_2}$ , $V^{(3)} \in \mathbb{R}^{d \times R_3}$ with $R_1 \geq R_2 \geq R_3$ parameterize outputs. The reconstruction is

$x^{(1)} = V^{(1)} \alpha,\quad x^{(2)} = V^{(2)} (\alpha * \alpha),\quad x^{(3)} = V^{(3)} (\alpha * \alpha * \alpha)$

with $*$ denoting elementwise product. This construction yields tensor factorizations:

$P_{ij} \approx \sum_{r=1}^{R_2} U_{i,r} U_{j,r} V^{(2)}_{:,r}$

$T_{ijk} \approx \sum_{r=1}^{R_3} U_{i,r} U_{j,r} U_{k,r} V^{(3)}_{:,r}$

All interaction orders share $U$ , ensuring interaction structure remains aligned with learned features. Enforcing $U^\top U = I$ (orthonormality) prevents degeneracy. Empirical ablations show that increasing $R_2, R_3$ beyond $\sim 0.1 R_1$ yields negligible reconstruction gains, supporting the hypothesis that interaction structure is low-rank in practice.

4. Training Objective and Computational Considerations

PolySAE is trained with a reconstruction loss plus optional sparsity regularization:

$z^{(b)} = S(\mathrm{ReLU}(W_{\rm enc}^\top x^{(b)} + b_{\rm enc}))$

$\hat{x}^{(b)} = \text{PolySAE decoder}$

$\mathcal{L} = \frac{1}{B} \sum_{b=1}^{B} \| x^{(b)} - \hat{x}^{(b)} \|_2^2 + {\text{(optionally)}} \lambda_{\ell_1} \|z^{(b)}\|_1$

Typically, hard $K$ -sparsity is used (no $\ell_1$ penalty), with Adam optimizer (learning rate $3 \times 10^{-4}$ ). $U$ is maintained orthonormal via positive-QR retraction post-update.

The parameter overhead, for $R_1 = d$ and $R_2 = R_3 \approx 0.1 d$ , adds $\sim 1.2 d^2$ parameters. On GPT-2 Small ( $d = 768$ ), this results in a decoder parameter increase of only $\sim 2.5$ – $3\%$ over a vanilla Top- $K$ SAE. This efficiency makes PolySAE tractable for large dictionary widths ( $d_{\rm sae} \sim 16,384$ ).

5. Empirical Evaluation

PolySAE was evaluated on residual activations from four LLMs (GPT-2 Small, Pythia-410M, Pythia-1.4B, Gemma-2-2B) and three sparsifiers (TopK, BatchTopK, Matryoshka), with $K=64$ and $d_{\rm sae}=16,\!384$ . Key empirical findings:

Reconstruction Error: MSE remains nearly unchanged (e.g., on GPT-2: SAE MSE $= 0.52$ , PolySAE MSE $= 0.55$ ).
Probing F1 Score: Average improvement of $\approx 8\%$ in F1 on six linguistic tasks across all models and sparsifiers (e.g., GPT-2 TopK: $67.1\% \rightarrow 77.9\%$ ).
Distributional Separation: Wasserstein distances between class-conditional feature codes increase by $2$- $10\times$ , indicating better separation of semantic classes.
Compositionality vs. Co-occurrence: PolySAE’s learned quadratic interaction strengths ( $B_{ij}$ ) have low correlation with empirical co-occurrence ( $r=0.06$ ), compared to vanilla SAE feature covariances ( $r=0.82$ ). This demonstrates capacity for capturing true compositional (not surface-level) interactions.

Metric	SAE	PolySAE	Key Improvement
Reconstruction MSE (GPT-2)	0.52	0.55	$\sim$ unchanged
Probing F1 (GPT-2, TopK)	67.1%	77.9%	+10.8% absolute
Pearson $r$ (co-occurrence v. interaction)	0.82	0.06	Drastic decorrelation
Decoder Overhead (GPT-2)	—	$\sim$ 3%	Minimally increased size

6. Representative Feature Interaction Examples

Observed qualitative differences between PolySAE and standard SAEs underscore the model’s capability for capturing genuine composition:

Second-Order Interactions:

[star, stars] × [coffee, tea]: Correctly binding to “Starbucks” in appropriate contexts, contrasted with generic proper-noun firing in SAEs.
[surgery, repair] × [Trans, LGBT]: Specializes “surgery” under “Trans/LGBT,” tightening semantic scope.
[DNA, genetic] × [mod, mods]: Recovers the meaning of “genetic modification,” not just an agglomeration of "edit" features.

Third-Order Interactions:

[proved, proven] × [star, stars, superstar] × [reputation, fame]: Isolates relevant multiway semantic binding for sentences like “David Bowie proved some stars are big enough...”.
[black, racial] × [Americans, Canadians] × [people, women]: Disambiguates “Black Americans” intersection away from generic ethnicity features.

These interactions show that PolySAE allocates decoding capacity to morphology, phrasal semantics, and named-entity composition without proliferating monolithic dictionary atoms.

7. Summary and Significance

PolySAE introduces tractable, low-rank higher-order decoding to sparse autoencoders, enabling feature interactions that reflect genuine compositional semantics in LLM activations. It maintains the linear, interpretable encoder—critical for feature analysis—while enhancing decoder expressivity with minimal computational and parameter overhead ( $\sim 3\%$ on GPT-2 Small). PolySAE demonstrates robust empirical improvements: average $+8\%$ gain in probing F1, $2$– $10\times$ better class separation, and nearly zero correlation with surface-level co-occurrence. Its design and evaluation suggest a substantial advance in the analysis of compositional structure in neural representations (Koromilas et al., 1 Feb 2026).

Markdown Upgrade to Chat

References (1)

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to PolySAE.