Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 25 tok/s Pro
GPT-4o 69 tok/s Pro
Kimi K2 197 tok/s Pro
GPT OSS 120B 439 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Measuring Feature Sparsity in Language Models (2310.07837v2)

Published 11 Oct 2023 in cs.LG

Abstract: Recent works have proposed that activations in LLMs can be modelled as sparse linear combinations of vectors corresponding to features of input text. Under this assumption, these works aimed to reconstruct feature directions using sparse coding. We develop metrics to assess the success of these sparse coding techniques and test the validity of the linearity and sparsity assumptions. We show our metrics can predict the level of sparsity on synthetic sparse linear activations, and can distinguish between sparse linear data and several other distributions. We use our metrics to measure levels of sparsity in several LLMs. We find evidence that LLM activations can be accurately modelled by sparse linear combinations of features, significantly more so than control datasets. We also show that model activations appear to be sparsest in the first and final layers.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):4311–4322, 2006.
  2. Linear Algebraic Structure of Word Senses, with Applications to Polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018.
  3. On Pixel-Wise Explanations for Non-Linear Classifier Decisions by Layer-Wise Relevance Propagation. PLOS ONE, 10, 2015.
  4. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1):183–202, 2009.
  5. Dota 2 with Large Scale Deep Reinforcement Learning. arXiv preprint arXiv:1912.06680, 2019.
  6. Understanding and Improving Interpolation in Autoencoders via an Adversarial Regularizer. In International Conference on Learning Representations, 2019.
  7. Generalization in NLI: Ways (Not) To Go Beyond Simple Heuristics. In EMNLP 2021 Workshop on Insights from Negative Results, 2021.
  8. Language models can explain neurons in language models. OpenAI Blog, 2023. URL https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html.
  9. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow, 2021.
  10. Optimizing the Latent Space of Generative Networks. In International Conference on Machine Learning, 2018.
  11. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020.
  12. Discovering Latent Knowledge in Language Models Without Supervision. In International Conference on Learning Representations, 2023.
  13. Causal scrubbing: A method for rigorously testing interpretability hypotheses. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  14. Towards Automated Circuit Discovery for Mechanistic Interpretability. arXiv preprint arXiv:2304.14997, 2023.
  15. Sparse Autoencoders Find Highly Interpretable Features in Language Models. arXiv preprint arXiv:2309.08600, 2023.
  16. TinyStories: How Small Can Language Models Be and Still Speak Coherent English? arXiv preprint arXiv:2305.07759, 2023.
  17. Toy Models of Superposition. Transformer Circuits Thread, 2022.
  18. Method of optimal directions for frame design. In IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999.
  19. Visualizing higher-layer features of a deep network. University of Montreal, 1341(3):1, 2009.
  20. Sparse Overcomplete Word Vector Representations. Proceedings of Association for Computational Linguistics, pp. 1491–1500, 2015.
  21. Causal Abstractions of Neural Networks. In Advances in Neural Information Processing Systems, 2021.
  22. Editing Models with Task Arithmetic. arXiv preprint arXiv:2212.04089, 2022.
  23. Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873):583–589, 2021.
  24. Efficient sparse coding algorithms. In Advances in Neural Information Processing Systems, 2006.
  25. A Unified Approach to Interpreting Model Predictions. In Advances in Neural Information Processing Systems, 2017.
  26. Locating and Editing Factual Associations in GPT. In Advances in Neural Information Processing Systems, 2022.
  27. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013.
  28. The Singular Value Decompositions of Transformer Weight Matrices are Highly Interpretable. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/mkbGjzxD8d8XqKHzA/the-singular-value-decompositions-of-transformer-weight.
  29. Progress measures for grokking via mechanistic interpretability. In International Conference on Learning Representations, 2023.
  30. CLIP-Dissect: Automatic Description of Neuron Representations in Deep Vision Networks. In International Conference on Learning Representations, 2023.
  31. Feature Visualization. Distill, 2017. URL https://distill.pub/2017/feature-visualization/.
  32. Zoom In: An Introduction to Circuits. Distill, 2020. URL https://distill.pub/2020/circuits/zoom-in/.
  33. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583):607–609, 1996.
  34. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research, 37(23):3311–3325, 1997.
  35. OpenAI. GPT-4 Technical Report. arxiv preprint arXiv:2303.08774, 2023.
  36. Interpreting the Latent Space of Generative Adversarial Networks using Supervised Learning. In International Conference on Advanced Computing and Applications, 2020.
  37. Language Models are Unsupervised Multitask Learners, 2019.
  38. Linear Adversarial Concept Erasure. In International Conference on Machine Learning, 2022.
  39. Polysemanticity and Capacity in Neural Networks. arXiv preprint arXiv:2210.01892, 2022.
  40. Taking features out of superposition with sparse autoencoders. AI Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/z6QQJbtpkEAX3Aojj/interim-research-report-taking-features-out-of-superposition.
  41. Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv preprint arXiv:1712.01815, 2017.
  42. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In International Conference on Learning Representations, 2014.
  43. Well-Read Students Learn Better: The Impact of Student Initialization on Knowledge Distillation. CoRR, 2019.
  44. Activation Addition: Steering Language Models Without Optimization. arXiv preprint arXiv:2308.10248, 2023.
  45. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  46. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. In International Conference on Learning Representations, 2023.
  47. Interpreting Language Models with Contrastive Explanations. In Conference on Empirical Methods in Natural Language Processing, 2022.
  48. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In DeeLIO 2021 Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, 2021.
Citations (1)

Summary

  • The paper introduces novel metrics, including normalized loss and average coefficient norm, to quantify feature sparsity in language model activations.
  • It applies these metrics to synthetic data and transformer models like BERT and GPT, revealing significant layer-wise sparsity variations.
  • The findings enhance model interpretability and offer practical insights for developing robust, explainable neural architectures.

An Analysis of Feature Sparsity in LLMs

The paper "Measuring Feature Sparsity in LLMs" by Deng, Tao, and Benton offers a quantitative exploration into the sparsity and linearity assumptions within transformer-based LLMs. The paper is grounded in the hypothesis that model activations can be represented as sparse linear combinations of input text features, providing a theoretical framework for interpreting these activations and suggesting a systematic method for assessing the validity of this assumption. By introducing robust metrics and applying them to various LLMs, the paper advances the methodology of understanding internal model representations.

The authors propose several novel metrics for measuring the effectiveness of sparse coding techniques in reconstructing feature directions. These metrics include normalized loss and average coefficient norm, which are designed to gauge the sparsity levels in neural network activations reliably. The work specifically addresses two fundamental assumptions: the linear representation hypothesis, stipulating that neural activations can be expressed linearly in terms of feature vectors, and the sparsity hypothesis, which asserts that only a subset of features are active for a given input.

Empirically, the paper demonstrates the utility of their metrics by applying them to synthetic datasets with known sparsity levels. The authors show that normalized loss and average coefficient norm metrics can accurately predict the degree of sparsity in synthetic data, outperforming other considered metrics, such as the number of non-zero entries. The experiments further validate the metrics’ ability to distinguish between data generated from a sparse linear process and other non-sparse structures. Importantly, these findings underpin the subsequent analysis of real-world LLMs, suggesting that sparse coding methodology can meaningfully decompose and interpret model activations.

Applying their findings to transformer models like BERT and GPT variations, the authors reveal insights about the distribution of feature sparsity across model layers. It is observed that embedding layers exhibit significant sparsity, aligning with the hypothesis that initial model layers capture fundamental linguistic features. Interestingly, sparsity levels vary across layers, with a noted increase in final layers, potentially reflecting a filtering process concentrating key predictive features.

The implications of this research extend both theoretical and practical domains. Theoretically, it bolsters the understanding of neural representations' linearity and sparsity, aligning with observations made in neuron-functionality studies. Practically, it opens avenues for enhanced model interpretability and debugging, which are critical for deploying models in sensitive real-world applications. The findings could guide the development of more interpretable network architectures and inspire further work on automated feature extraction based on sparse representations.

Looking forward, these results prompt questions about the integration of sparse coding techniques into model training processes, potentially enabling models to learn more disentangled and interpretable features innately. Enhancing interpretability by understanding internal representations could also help in aligning AI behaviors with desired or ethical outcomes, a key challenge in the expanding deployment of LLMs. The methodology and metrics introduced in this paper provide a rigorous framework for future research aiming to decipher and enhance model interpretability in more complex, real-world datasets and architectures.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: