PolySAE: Sparse Autoencoder with Polynomial Decoding
- PolySAE is a sparse autoencoder variant that extends linear reconstructions by incorporating quadratic and cubic terms to model pairwise and triple feature interactions.
- It employs a low-rank sharedâsubspace factorization to efficiently approximate highâorder tensors and reduce parameter overhead in the decoding process.
- Empirical evaluations demonstrate an 8%-10% F1 improvement and enhanced semantic separation, all achieved with a minimal increase in decoder complexity.
PolySAE is a sparse autoencoder (SAE) variant designed to capture the compositional structure in neural network representations by extending the linear reconstruction found in traditional SAEs to include higher-order polynomial feature interactions. Unlike classic SAEs, which decompose activations into sparse superpositions of additive dictionary atoms, PolySAE introduces quadratic and cubic decoding terms that enable modeling of pairwise and triple feature bindings. This extension preserves the interpretability provided by a linear encoder but allows the decoder to represent meanings not expressible in a purely additive basis, such as compounds, morphological binding, and multi-entity composition within LLM activations (Koromilas et al., 1 Feb 2026).
1. Motivation and Theoretical Foundations
Sparse autoencoders are widely used for interpreting the superposed internal representations of LLMs. These models express a given activation vector as a sparse code , with and constituting the encoder and decoder, respectively. The conventional form:
where is a sparsifier (e.g., Top- operator), produces reconstructions via linear combination of dictionary atoms. However, the additive structure cannot disambiguate true compositional semantics from simple co-occurrence. For instance, distinguishing "Starbucks" as a composition of "star" and "coffee" features is not feasible with purely linear models; such SAEs must allocate dedicated features for the compound, compromising atomicity.
PolySAE addresses this by incorporating polynomial decoding: it enables the decoder to express both and interactions, thus enriching the representational power without compromising sparse, interpretable encodings.
2. Polynomial Decoder Architecture
Let denote the output of the linear encoder and sparsifier. The PolySAE decoder reconstructs activations as:
with:
- (linear term, typically )
- (quadratic term)
- (cubic term)
- are learned scalars modulating higher-order contributions.
Storage of and naively scales as and , prohibitive for . PolySAE circumvents this via low-rank tensor factorization on a shared projection subspace.
3. Low-Rank Shared-Subspace Factorization
Feature interaction tensors are approximated in a low-rank fashion using a shared basis. Defining (feature-projection matrix, typically ), project as . Output-projection matrices , , with parameterize outputs. The reconstruction is
with denoting elementwise product. This construction yields tensor factorizations:
All interaction orders share , ensuring interaction structure remains aligned with learned features. Enforcing (orthonormality) prevents degeneracy. Empirical ablations show that increasing beyond yields negligible reconstruction gains, supporting the hypothesis that interaction structure is low-rank in practice.
4. Training Objective and Computational Considerations
PolySAE is trained with a reconstruction loss plus optional sparsity regularization:
Typically, hard -sparsity is used (no penalty), with Adam optimizer (learning rate ). is maintained orthonormal via positive-QR retraction post-update.
The parameter overhead, for and , adds parameters. On GPT-2 Small (), this results in a decoder parameter increase of only â over a vanilla Top- SAE. This efficiency makes PolySAE tractable for large dictionary widths ().
5. Empirical Evaluation
PolySAE was evaluated on residual activations from four LLMs (GPT-2 Small, Pythia-410M, Pythia-1.4B, Gemma-2-2B) and three sparsifiers (TopK, BatchTopK, Matryoshka), with and . Key empirical findings:
- Reconstruction Error: MSE remains nearly unchanged (e.g., on GPT-2: SAE MSE , PolySAE MSE ).
- Probing F1 Score: Average improvement of in F1 on six linguistic tasks across all models and sparsifiers (e.g., GPT-2 TopK: ).
- Distributional Separation: Wasserstein distances between class-conditional feature codes increase by $2$-, indicating better separation of semantic classes.
- Compositionality vs. Co-occurrence: PolySAEâs learned quadratic interaction strengths () have low correlation with empirical co-occurrence (), compared to vanilla SAE feature covariances (). This demonstrates capacity for capturing true compositional (not surface-level) interactions.
| Metric | SAE | PolySAE | Key Improvement |
|---|---|---|---|
| Reconstruction MSE (GPT-2) | 0.52 | 0.55 | unchanged |
| Probing F1 (GPT-2, TopK) | 67.1% | 77.9% | +10.8% absolute |
| Pearson (co-occurrence v. interaction) | 0.82 | 0.06 | Drastic decorrelation |
| Decoder Overhead (GPT-2) | â | 3% | Minimally increased size |
6. Representative Feature Interaction Examples
Observed qualitative differences between PolySAE and standard SAEs underscore the modelâs capability for capturing genuine composition:
Second-Order Interactions:
- [star, stars] Ă [coffee, tea]: Correctly binding to âStarbucksâ in appropriate contexts, contrasted with generic proper-noun firing in SAEs.
- [surgery, repair] Ă [Trans, LGBT]: Specializes âsurgeryâ under âTrans/LGBT,â tightening semantic scope.
- [DNA, genetic] Ă [mod, mods]: Recovers the meaning of âgenetic modification,â not just an agglomeration of "edit" features.
Third-Order Interactions:
- [proved, proven] Ă [star, stars, superstar] Ă [reputation, fame]: Isolates relevant multiway semantic binding for sentences like âDavid Bowie proved some stars are big enough...â.
- [black, racial] Ă [Americans, Canadians] Ă [people, women]: Disambiguates âBlack Americansâ intersection away from generic ethnicity features.
These interactions show that PolySAE allocates decoding capacity to morphology, phrasal semantics, and named-entity composition without proliferating monolithic dictionary atoms.
7. Summary and Significance
PolySAE introduces tractable, low-rank higher-order decoding to sparse autoencoders, enabling feature interactions that reflect genuine compositional semantics in LLM activations. It maintains the linear, interpretable encoderâcritical for feature analysisâwhile enhancing decoder expressivity with minimal computational and parameter overhead ( on GPT-2 Small). PolySAE demonstrates robust empirical improvements: average gain in probing F1, $2$â better class separation, and nearly zero correlation with surface-level co-occurrence. Its design and evaluation suggest a substantial advance in the analysis of compositional structure in neural representations (Koromilas et al., 1 Feb 2026).