SynthSAEBench-16k: A Benchmark for SAEs
- SynthSAEBench-16k is a synthetic benchmark that generates activation vectors by superimposing ground-truth features to mimic realistic LLM representation properties.
- It employs a controlled sparse, correlated, and hierarchical structure to assess SAE models’ feature recovery, overfitting, and architectural innovations.
- Empirical evaluations reveal trade-offs between reconstruction accuracy and feature alignment, setting a precise baseline for SAE comparisons.
SynthSAEBench-16k defines a standardized large-scale synthetic benchmark for sparse autoencoders (SAEs), designed to address the limitations of previous noisy or unrealistic benchmarks for analyzing the emergent properties and failure modes of SAE models. SynthSAEBench-16k provides ground-truth feature structure, correlations, hierarchy, and superposition phenomena at a scale meaningful for direct comparison to LLM neuron interpretability, enabling precise evaluation of architectural innovations, probing performance, and overfitting behaviors in SAEs (Chanin et al., 16 Feb 2026).
1. Synthetic Data Generation
SynthSAEBench-16k generates activation vectors by superimposing a sparse linear combination of ground-truth feature vectors. This process explicitly encodes several realistic properties seen in LLM representation structure:
- Ground-Truth Dictionary and Superposition:
Feature vectors , each in , are generated via
The mean-maximum superposition overlap is regulated through an orthogonality-promoting loss, giving for , .
- Sparse, Correlated, and Hierarchical Coefficients:
Binary firing patterns are sampled using a Gaussian copula with covariance 0, where 1 is a small-rank random matrix and 2 ensures nonnegative variances. Marginals 3 follow a Zipfian law (4), reflecting feature frequency skew:
5
Hierarchical structure enforces parent gating, mutual exclusion (only one sibling firing per parent), and (optionally) magnitude scaling by parent feature intensity over a forest of 128 trees, branching factor 4, depth up to 3.
- Final Activation Assembly:
The activation for one sample is:
6
with 7 rectified Gaussian coefficients if 8, otherwise zero, and a bias vector 9 of norm 10.
2. SAE Benchmark Protocol and Architectures
SynthSAEBench-16k specifies both data and an SAE model training regime to serve as a reproducible baseline for architectural comparison:
- Model and Training Regime:
- Input dimension 0, number of dictionary features 1, SAE width (number of latents) 2
- 200 million training samples with batch size 1024; Adam optimizer, learning rate decay over the last training third
- Architectural Variants:
| SAE Variant | Sparsity Mechanism | Notable Loss/Regularization | |---------------------|-----------------------------------------|----------------------------------------| | L1-SAE | ReLU + 3 loss, dynamic 4 | Target 5, controller tunes 6 | | BatchTopK | Soft top-7, batchwise 8 | Dead-latent auxiliary loss | | JumpReLU | Thresholded linear unit | As above | | Matryoshka (prefix) | Nested BatchTopK prefixes, multi-loss | Each prefix must reconstruct | | MP-SAE | Matching Pursuit—greedy decoding | No encoder matrix, matching residuals |
All models use the general objective:
9
where 0 is an 1 or 2 regularizer, 3 penalizes dead latents, and 4 denotes the latent code.
3. Evaluation Metrics and Diagnostic Regime
SynthSAEBench-16k leverages known ground-truth code and dictionary, enabling a multi-faceted metric suite that dissects both standard autoencoding and feature-recovery performance:
- Reconstruction Metrics
- Mean squared error (MSE)
- Explained variance 5
- Feature-Recovery/Alignment Metrics
- Mean correlation coefficient (MCC) of decoder columns with true features (Hungarian matching)
- Feature uniqueness (fraction of unique best matches)
- Probing and Binary Classification
- For latent-feature pairs, mean precision, recall, F1, and AUC, interpreting latent activations as feature detectors
- Sparsity Diagnostics
- Actual latent 6 (number of nonzero latents)
- Dead latents (latents never firing on the test set)
4. Empirical Phenomena and Benchmarked Behaviors
SynthSAEBench-16k surfaces several empirical phenomena paralleling observations from LLM-based SAE studies and exposes new failure modes:
- Reconstruction–Latent Quality Disconnect:
MP-SAEs obtain the highest explained variance (∼0.96 at 7) but much lower MCC (∼0.5) and F1 (∼0.6), attributable to overfitting to superposition noise. Conversely, Matryoshka SAEs exhibit high MCC/F1 (up to ∼0.88) but substantially inferior reconstruction (8).
- Precision–Recall–L0 Trade-off:
All architectures display increasing recall and decreasing precision as the enforced sparsity (9) grows from 15 to 45. This mirrors findings in LLM neurons: sparser codes select fewer, more precise features, while denser codes increase coverage at the cost of false positives.
- Matching Pursuit Overfitting:
MP-SAEs improve reconstruction as the hidden dimension decreases (superposition increases) but their MCC and F1 degrade further, i.e., more flexible encoders can overfit by mixing spurious superposition correlations, rather than learning actual dictionary atoms.
5. Comparison to Supervised Probing
SynthSAEBench-16k enables direct comparison to supervised probes. Logistic regression classifiers trained on SAE latents achieve mean F1 scores around 0.974 and AUC approximately 0.9999 (on the 4096 most frequent features), substantially outperforming all SAE variants. This reaffirms, even in idealized settings with linear ground-truth codes, the consistent disadvantage of unsupervised sparse autoencoding compared to supervised probing for feature recovery.
6. Significance and Research Applications
SynthSAEBench-16k sets a new standard for diagnosing SAE architectures at realistic scale, offering:
- Ground-truth alignment for quantifying overfitting, superposition, and latent interpretability phenomena
- Controlled ablations for probing correlation, hierarchy, and Zipfian feature frequency effects
- Benchmarking under precisely regulated levels of superposition and feature mixing, facilitating objective architectural comparisons before scaling to intractable LLM representations
SynthSAEBench-16k is directly positioned as a complement to open-vocabulary LLM neuron studies, enabling low-noise hypothesistesting, reproducible failure mode diagnosis, and validation of architectural improvements using synthetic data before transfer to LLMs (Chanin et al., 16 Feb 2026).