Papers
Topics
Authors
Recent
Search
2000 character limit reached

SynthSAEBench-16k: A Benchmark for SAEs

Updated 18 February 2026
  • SynthSAEBench-16k is a synthetic benchmark that generates activation vectors by superimposing ground-truth features to mimic realistic LLM representation properties.
  • It employs a controlled sparse, correlated, and hierarchical structure to assess SAE models’ feature recovery, overfitting, and architectural innovations.
  • Empirical evaluations reveal trade-offs between reconstruction accuracy and feature alignment, setting a precise baseline for SAE comparisons.

SynthSAEBench-16k defines a standardized large-scale synthetic benchmark for sparse autoencoders (SAEs), designed to address the limitations of previous noisy or unrealistic benchmarks for analyzing the emergent properties and failure modes of SAE models. SynthSAEBench-16k provides ground-truth feature structure, correlations, hierarchy, and superposition phenomena at a scale meaningful for direct comparison to LLM neuron interpretability, enabling precise evaluation of architectural innovations, probing performance, and overfitting behaviors in SAEs (Chanin et al., 16 Feb 2026).

1. Synthetic Data Generation

SynthSAEBench-16k generates activation vectors aRDa \in \mathbb{R}^D by superimposing a sparse linear combination of N=16,384N = 16{,}384 ground-truth feature vectors. This process explicitly encodes several realistic properties seen in LLM representation structure:

  • Ground-Truth Dictionary and Superposition:

Feature vectors {di}\{\mathbf{d}_i\}, each in RD\mathbb{R}^D, are generated via

di=gigi2,giN(0,ID)\mathbf{d}_i = \frac{\mathbf{g}_i}{\|\mathbf{g}_i\|_2}, \quad \mathbf{g}_i \sim \mathcal{N}(0,I_D)

The mean-maximum superposition overlap ρmm\rho_{\mathrm{mm}} is regulated through an orthogonality-promoting loss, giving ρmm0.15\rho_{\mathrm{mm}} \approx 0.15 for D=768D = 768, N=16,384N = 16{,}384.

  • Sparse, Correlated, and Hierarchical Coefficients:

Binary firing patterns ziz_i are sampled using a Gaussian copula with covariance N=16,384N = 16{,}3840, where N=16,384N = 16{,}3841 is a small-rank random matrix and N=16,384N = 16{,}3842 ensures nonnegative variances. Marginals N=16,384N = 16{,}3843 follow a Zipfian law (N=16,384N = 16{,}3844), reflecting feature frequency skew:

N=16,384N = 16{,}3845

Hierarchical structure enforces parent gating, mutual exclusion (only one sibling firing per parent), and (optionally) magnitude scaling by parent feature intensity over a forest of 128 trees, branching factor 4, depth up to 3.

  • Final Activation Assembly:

The activation for one sample is:

N=16,384N = 16{,}3846

with N=16,384N = 16{,}3847 rectified Gaussian coefficients if N=16,384N = 16{,}3848, otherwise zero, and a bias vector N=16,384N = 16{,}3849 of norm 10.

2. SAE Benchmark Protocol and Architectures

SynthSAEBench-16k specifies both data and an SAE model training regime to serve as a reproducible baseline for architectural comparison:

  • Model and Training Regime:
    • Input dimension {di}\{\mathbf{d}_i\}0, number of dictionary features {di}\{\mathbf{d}_i\}1, SAE width (number of latents) {di}\{\mathbf{d}_i\}2
    • 200 million training samples with batch size 1024; Adam optimizer, learning rate decay over the last training third
  • Architectural Variants:

| SAE Variant | Sparsity Mechanism | Notable Loss/Regularization | |---------------------|-----------------------------------------|----------------------------------------| | L1-SAE | ReLU + {di}\{\mathbf{d}_i\}3 loss, dynamic {di}\{\mathbf{d}_i\}4 | Target {di}\{\mathbf{d}_i\}5, controller tunes {di}\{\mathbf{d}_i\}6 | | BatchTopK | Soft top-{di}\{\mathbf{d}_i\}7, batchwise {di}\{\mathbf{d}_i\}8 | Dead-latent auxiliary loss | | JumpReLU | Thresholded linear unit | As above | | Matryoshka (prefix) | Nested BatchTopK prefixes, multi-loss | Each prefix must reconstruct | | MP-SAE | Matching Pursuit—greedy decoding | No encoder matrix, matching residuals |

All models use the general objective:

{di}\{\mathbf{d}_i\}9

where RD\mathbb{R}^D0 is an RD\mathbb{R}^D1 or RD\mathbb{R}^D2 regularizer, RD\mathbb{R}^D3 penalizes dead latents, and RD\mathbb{R}^D4 denotes the latent code.

3. Evaluation Metrics and Diagnostic Regime

SynthSAEBench-16k leverages known ground-truth code and dictionary, enabling a multi-faceted metric suite that dissects both standard autoencoding and feature-recovery performance:

  • Reconstruction Metrics
    • Mean squared error (MSE)
    • Explained variance RD\mathbb{R}^D5
  • Feature-Recovery/Alignment Metrics
    • Mean correlation coefficient (MCC) of decoder columns with true features (Hungarian matching)
    • Feature uniqueness (fraction of unique best matches)
  • Probing and Binary Classification
    • For latent-feature pairs, mean precision, recall, F1, and AUC, interpreting latent activations as feature detectors
  • Sparsity Diagnostics
    • Actual latent RD\mathbb{R}^D6 (number of nonzero latents)
    • Dead latents (latents never firing on the test set)

4. Empirical Phenomena and Benchmarked Behaviors

SynthSAEBench-16k surfaces several empirical phenomena paralleling observations from LLM-based SAE studies and exposes new failure modes:

  • Reconstruction–Latent Quality Disconnect:

MP-SAEs obtain the highest explained variance (∼0.96 at RD\mathbb{R}^D7) but much lower MCC (∼0.5) and F1 (∼0.6), attributable to overfitting to superposition noise. Conversely, Matryoshka SAEs exhibit high MCC/F1 (up to ∼0.88) but substantially inferior reconstruction (RD\mathbb{R}^D8).

  • Precision–Recall–L0 Trade-off:

All architectures display increasing recall and decreasing precision as the enforced sparsity (RD\mathbb{R}^D9) grows from 15 to 45. This mirrors findings in LLM neurons: sparser codes select fewer, more precise features, while denser codes increase coverage at the cost of false positives.

  • Matching Pursuit Overfitting:

MP-SAEs improve reconstruction as the hidden dimension decreases (superposition increases) but their MCC and F1 degrade further, i.e., more flexible encoders can overfit by mixing spurious superposition correlations, rather than learning actual dictionary atoms.

5. Comparison to Supervised Probing

SynthSAEBench-16k enables direct comparison to supervised probes. Logistic regression classifiers trained on SAE latents achieve mean F1 scores around 0.974 and AUC approximately 0.9999 (on the 4096 most frequent features), substantially outperforming all SAE variants. This reaffirms, even in idealized settings with linear ground-truth codes, the consistent disadvantage of unsupervised sparse autoencoding compared to supervised probing for feature recovery.

6. Significance and Research Applications

SynthSAEBench-16k sets a new standard for diagnosing SAE architectures at realistic scale, offering:

  • Ground-truth alignment for quantifying overfitting, superposition, and latent interpretability phenomena
  • Controlled ablations for probing correlation, hierarchy, and Zipfian feature frequency effects
  • Benchmarking under precisely regulated levels of superposition and feature mixing, facilitating objective architectural comparisons before scaling to intractable LLM representations

SynthSAEBench-16k is directly positioned as a complement to open-vocabulary LLM neuron studies, enabling low-noise hypothesistesting, reproducible failure mode diagnosis, and validation of architectural improvements using synthetic data before transfer to LLMs (Chanin et al., 16 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SynthSAEBench-16k.