Sparse Autoencoder Features for Classifications and Transferability (2502.11367v1)

Published 17 Feb 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Sparse Autoencoders (SAEs) provide potentials for uncovering structured, human-interpretable representations in LLMs, making them a crucial tool for transparent and controllable AI systems. We systematically analyze SAE for interpretable feature extraction from LLMs in safety-critical classification tasks. Our framework evaluates (1) model-layer selection and scaling properties, (2) SAE architectural configurations, including width and pooling strategies, and (3) the effect of binarizing continuous SAE activations. SAE-derived features achieve macro F1 > 0.8, outperforming hidden-state and BoW baselines while demonstrating cross-model transfer from Gemma 2 2B to 9B-IT models. These features generalize in a zero-shot manner to cross-lingual toxicity detection and visual classification tasks. Our analysis highlights the significant impact of pooling strategies and binarization thresholds, showing that binarization offers an efficient alternative to traditional feature selection while maintaining or improving performance. These findings establish new best practices for SAE-based interpretability and enable scalable, transparent deployment of LLMs in real-world applications. Full repo: https://github.com/shan23chen/MOSAIC.

Summary

The paper introduces a method using Sparse Autoencoders on LLM hidden states to extract sparse, interpretable, token-level features for downstream classification tasks in safety-critical and transferability settings.
Optimal results derive from features extracted from middle layers using a summation-then-binarization pooling strategy, consistently outperforming hidden-state probes and TF-IDF baselines while being more computationally efficient.
The study shows strong multilingual transfer with only a 15-20% F1 drop and promising text-to-vision generalization, demonstrating features from smaller models can predict larger model behavior for scalable oversight with >80% F1 scores.

The paper introduces a comprehensive paper leveraging Sparse Autoencoders (SAEs) to extract interpretable, token-level features from Large LLMs for downstream classification tasks under safety-critical and transferability settings. The work systematically evaluates several critical hyperparameters and methodological choices that influence the extraction and utility of SAE-derived representations.

The paper begins by detailing a reproducible pipeline wherein hidden activations are extracted from selected layers in Gemma 2 models (of varying model scales, e.g., 2B, 9B, and 9B-IT). The SAE operates by up-projecting the dense residual stream representations into a higher-dimensional sparse space. This mechanism is designed to yield sparse and, ideally, monosemantic activations that can be used for interpreting model decisions. Key aspects of the pipeline include:

Layer and Model Scale Analysis:
- Activations are extracted from early, middle, and late layers. The results indicate that middle-layer features tend to offer a superior balance by capturing both semantic and syntactic information, with larger models achieving mean macro F1 scores exceeding 0.85.
- Multiple SAE widths are compared (e.g., 16K vs. 65K for the 2B model, and 16K vs. 131K for the 9B and 9B-IT models), and the paper notes a trade-off in feature discriminability with increasing width, particularly in non-binarized settings.
Pooling and Binarization Strategies:
- Two primary pooling strategies are evaluated: summation of token-level features (no max pooling) versus selection of the top‑ $N$ activations per token.
- Binarization of the pooled feature vector, applied via a threshold (activations above the threshold are set to 1, others to 0), not only reduces computational overhead but also serves as an implicit feature selection mechanism akin to a non-linear ReLU activation.
- Empirical results reveal that binarized features from full, token-level activations consistently outperform both conventional hidden-state probes and bag-of-words (TF‑IDF) baselines. In certain cases, the token-level top‑ $N$ strategy shows marginal performance improvements at the cost of additional computation.
Multilingual and Multimodal Transfer:
- The paper extends the evaluation to multilingual toxicity detection tasks. Experiments compare native training (training and testing on the same language) with cross-lingual transfer (training on English and testing on other languages), and the findings indicate that while native training achieves peak F1 scores—often above 0.99 for English—the SAE features maintain robust performance with only a 15–20% decrease when transferred to other languages.
- In addition, a preliminary investigation into multimodal transfer involves applying SAEs trained on text inputs to a vision-LLM. In tasks such as CIFAR-100 classification, the extracted SAE features via summation and binarization have demonstrated promising results, suggesting a feasible cross-modal generalization.
Behavioral Prediction and Scalable Oversight:
- A unique contribution is the application of SAE features for action prediction, where smaller models are used to predict the correctness of outputs generated by larger, instruction-tuned models in knowledge-intensive question answering tasks.
- The experimental design uses logistic regression classifiers to predict binary correctness labels. Remarkably, even features extracted from smaller models (e.g., Gemma 2-2B) achieve macro F1 scores above 80% in predicting the behavior of the larger models, and they in some cases match or outperform the larger models’ own features.
- This result underlines the potential for scalable oversight—in which less complex, interpretable models monitor the behavior of more capable systems—thereby enhancing safety and accountability.
Comparative Baselines and Feature Selection Methods:
- In addition to SAE features, the paper benchmarks against hidden-state representations and a top‑ $N$ mean-difference baseline. SAE-based representations demonstrate superior peak performance, though the hidden-state proxy offers lower variance.
- Extensive experiments across varying training data sampling rates confirm the robustness of SAE-based features—even under data-scarce conditions—while consistently outperforming simpler methods such as TF‑IDF and mean-difference feature selection.

Overall, the paper establishes that a summation-then-binarization strategy for SAE features not only enhances interpretability but also improves classification performance across diverse safety-centric, multilingual, and multimodal applications. The work provides clear guidelines for selecting model layers, configuring SAE widths, and choosing pooling methods, thereby offering valuable insights for researchers seeking transparent and scalable approaches for model introspection and oversight in critical application domains.

PDF Markdown

GitHub

GitHub - shan23chen/MOSAIC (2 stars)

Sparse Autoencoder Features for Classifications and Transferability (2502.11367v1)

Summary

Related Papers

GitHub