Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 185 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

The Geometry of Concepts: Sparse Autoencoder Feature Structure (2410.19750v2)

Published 10 Oct 2024 in q-bio.NC, cs.AI, and cs.LG

Abstract: Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by LLMs. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
  2. Showing sae latents are not atomic using meta-saes. AI Alignment Forum, Aug 2024. URL https://www.alignmentforum.org/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes. Accessed on October 1, 2024.
  3. Belur V Dasarathy. Nearest neighbor (nn) norms: Nn pattern classification techniques. IEEE Computer Society Tutorial, 1991.
  4. Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
  5. Word embeddings, analogies, and machine learning: Beyond king-man+ woman= queen. In Proceedings of coling 2016, the 26th international conference on computational linguistics: Technical papers, pp.  3519–3530, 2016.
  6. Toy models of superposition. Transformer Circuits Thread, 2022.
  7. Not all language model features are linear. arXiv preprint arXiv:2405.14860, 2024.
  8. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  9. Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
  10. Monotonic representation of numeric properties in language models. arXiv preprint arXiv:2403.10381, 2024.
  11. In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023.
  12. Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2023.
  13. Paul Jaccard. Nouvelles recherches sur la distribution florale. Bulletin de la Société vaudoise des Sciences Naturelles, 44:223–270, 1908.
  14. Extracting sae task features for in-context learning. AI Alignment Forum, Aug 2024. URL https://www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning. Accessed on October 1, 2024.
  15. Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
  16. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
  17. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024.
  18. Using word2vec to process big text data. In 2015 IEEE International Conference on Big Data (Big Data), pp.  2895–2897. IEEE, 2015.
  19. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  20. Jake Mendel. Sae feature geometry is outside the superposition hypothesis. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis.
  21. Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023.
  22. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  23. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp.  1532–1543, 2014.
  24. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024.
  25. Lewis Smith. The ‘strong’ feature hypothesis could be wrong. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/tojtPCCRpKLSHBdpn/the-strong-feature-hypothesis-could-be-wrong.
  26. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
  27. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
  28. Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023.
  29. Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pp.  1073–1080, 2009.
  30. Linear discriminant analysis. Robust data mining, pp.  27–33, 2013.
  31. G. Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579–652, 1912.
Citations (1)

Summary

  • The paper demonstrates that sparse autoencoders reveal multi-scale geometric structures—atomic, brain, and galaxy scales—in large-scale language models.
  • It employs techniques like LDA for distractor removal and geometric clustering to enhance interpretability and functional localization of learned features.
  • The study challenges isotropic assumptions by uncovering a power-law eigenvalue distribution, offering a framework to understand hierarchical concept representations.

Sparse Autoencoder Feature Structure in Large-Scale LLMs

The paper "The Geometry of Concepts: Sparse Autoencoder Feature Structure" investigates the latent geometric structure of concepts within large-scale LLMs, analyzed through sparse autoencoders (SAEs). Sparse autoencoders have gained attention for revealing interpretable features within neural network models and provide a window into understanding the activation space as a higher-dimensional geometric structure. This research identifies three distinctive spatial scales within this feature space: atomic, brain, and galaxy scales, each corresponding to levels of complexity in interpretability and functional localization.

The atomic scale reveals small-scale geometric structures—referred to as "crystals"—within SAE feature spaces. These structures are characterized by parallelograms and trapezoids that represent semantic relations, akin to the analogy well-known in word embeddings such as (man:woman::king:queen). The paper demonstrates that the quality and accuracy of such structures are significantly enhanced by removing semantically irrelevant 'distractor' dimensions, utilizing Linear Discriminant Analysis (LDA) to project onto a more meaningful subspace. The discovery and manipulation of these crystals suggest a generalized framework for detecting informal concept analogies within the learned representations of LLMs.

The intermediate, or brain scale, investigates modular structures of learned features that exhibit functional similarity, paralleling neurological organizational principles in biological brains. The research identifies clusters of features that tend to co-activate within similar contexts, which translate into geometric proximity in the feature space, forming what the authors refer to as functional lobes. Using co-occurrence measures like Phi coefficient alongside geometric clustering techniques, the researchers quantify spatial modularity and correlate it with document-level data, distinguishing, for instance, geometric compartments associated with code/math versus narrative text features.

At the galaxy scale, the point cloud representing the aggregated SAE feature space exhibits anisotropic distribution, contrary to the assumption of an isotropic Gaussian arrangement. This large-scale structure is characterized by eigenvalues of the point cloud's covariance matrix obeying a power law, predominantly in the model's middle layers. The existence of such a power law reflects a non-random, organized emergence of complex features and abstractions that demonstrate a layered bottleneck in the neural network, aligning with theories of hierarchical representation learning.

The paper's empirical findings illuminate valuable insights for the theoretical understanding of how complex concepts are structured and stored within neural networks and sparse autoencoders. The paper not only challenges existing assumptions about isotropic feature distribution but also provides compelling evidence on the importance of spatial modularity and its implications for efficiently storing and retrieving learned concepts. As researchers continue to explore the intricacies of large-scale LLMs, these findings offer a robust foundation for future inquiries into the computational mechanisms underlying human-like linguistic comprehension and abstraction in artificial intelligence systems. Further investigations might focus on refining techniques for distractor removal and extending these concepts to other types of neural architectures or subject domains, paving the way for models with enhanced interpretability and practical efficacy in real-world applications.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We're still in the process of identifying open problems mentioned in this paper. Please check back in a few minutes.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 26 tweets and received 500 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com