Measuring Feature Sparsity in Language Models (2310.07837v2)
Abstract: Recent works have proposed that activations in LLMs can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several LLMs. We find evidence that LLM activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.
- K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
- Linear Algebraic Structure of Word Senses, with Applications to Polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018.
- On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10, 2015.
- A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
- Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680, 2019.
- Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer. In International Conference on Learning Representations, 2019.
- Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics. In EMNLP 2021 Workshop on Insights from Negative Results, 2021.
- Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
- GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, 2021.
- Optimizing the Latent Space of Generative Networks. In International Conference on Machine Learning, 2018.
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020.
- Discovering Latent Knowledge in Language Models Without Supervision. In International Conference on Learning Representations, 2023.
- Causal scrubbing: A method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
- Towards Automated Circuit Discovery for Mechanistic Interpretability. arXiv preprint arXiv:2304.14997, 2023.
- Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv preprint arXiv:2309.08600, 2023.
- TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759, 2023.
- Toy Models of Superposition. Transformer Circuits Thread, 2022.
- Method of optimal directions for frame design. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999.
- Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
- Sparse Overcomplete Word Vector Representations. Proceedings of Association for Computational Linguistics, pp. 1491–1500, 2015.
- Causal Abstractions of Neural Networks. In Advances in Neural Information Processing Systems, 2021.
- Editing Models with Task Arithmetic. arXiv preprint arXiv:2212.04089, 2022.
- Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
- Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems, 2006.
- A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, 2017.
- Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems, 2022.
- Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013.
- The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight.
- Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023.
- CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks. In International Conference on Learning Representations, 2023.
- Feature Visualization. Distill, 2017. URL https://distill.pub/2017/feature-visualization/.
- Zoom In: An Introduction to Circuits. Distill, 2020. URL https://distill.pub/2020/circuits/zoom-in/.
- Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
- Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325, 1997.
- OpenAI. GPT-4 Technical Report. arxiv preprint arXiv:2303.08774, 2023.
- Interpreting the Latent Space of Generative Adversarial Networks using Supervised Learning. In International Conference on Advanced Computing and Applications, 2020.
- Language Models are Unsupervised Multitask Learners, 2019.
- Linear Adversarial Concept Erasure. In International Conference on Machine Learning, 2022.
- Polysemanticity and Capacity in Neural Networks. arXiv preprint arXiv:2210.01892, 2022.
- Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
- Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv preprint arXiv:1712.01815, 2017.
- Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In International Conference on Learning Representations, 2014.
- Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. CoRR, 2019.
- Activation Addition: Steering Language Models Without Optimization. arXiv preprint arXiv:2308.10248, 2023.
- YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations, 2023.
- Interpreting Language Models with Contrastive Explanations. In Conference on Empirical Methods in Natural Language Processing, 2022.
- Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In DeeLIO 2021 Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.