The Geometry of Concepts: Sparse Autoencoder Feature Structure (2410.19750v2)
Abstract: Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by LLMs. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.
- Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer-circuits.pub/2023/monosemantic-features/index.html.
- Showing sae latents are not atomic using meta-saes. AI Alignment Forum, Aug 2024. URL https://www.alignmentforum.org/posts/TMAmHh4DdMr4nCSr5/showing-sae-latents-are-not-atomic-using-meta-saes. Accessed on October 1, 2024.
- Belur V Dasarathy. Nearest neighbor (nn) norms: Nn pattern classification techniques. IEEE Computer Society Tutorial, 1991.
- Lee R. Dice. Measures of the amount of ecologic association between species. Ecology, 26(3):297–302, 1945.
- Word embeddings, analogies, and machine learning: Beyond king-man+ woman= queen. In Proceedings of coling 2016, the 26th international conference on computational linguistics: Technical papers, pp. 3519–3530, 2016.
- Toy models of superposition. Transformer Circuits Thread, 2022.
- Not all language model features are linear. arXiv preprint arXiv:2405.14860, 2024.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Language models represent space and time. arXiv preprint arXiv:2310.02207, 2023.
- Monotonic representation of numeric properties in language models. arXiv preprint arXiv:2403.10381, 2024.
- In-context learning creates task vectors. arXiv preprint arXiv:2310.15916, 2023.
- Sparse autoencoders find highly interpretable features in language models. In The Twelfth International Conference on Learning Representations, 2023.
- Paul Jaccard. Nouvelles recherches sur la distribution florale. Bulletin de la Société vaudoise des Sciences Naturelles, 44:223–270, 1908.
- Extracting sae task features for in-context learning. AI Alignment Forum, Aug 2024. URL https://www.alignmentforum.org/posts/5FGXmJ3wqgGRcbyH7/extracting-sae-task-features-for-in-context-learning. Accessed on October 1, 2024.
- Sample estimate of the entropy of a random vector. Problemy Peredachi Informatsii, 23(2):9–16, 1987.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022.
- Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024.
- Using word2vec to process big text data. In 2015 IEEE International Conference on Big Data (Big Data), pp. 2895–2897. IEEE, 2015.
- The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
- Jake Mendel. Sae feature geometry is outside the superposition hypothesis. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/MFBTjb2qf3ziWmzz6/sae-feature-geometry-is-outside-the-superposition-hypothesis.
- Emergent linear representations in world models of self-supervised sequence models. arXiv preprint arXiv:2309.00941, 2023.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
- Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
- Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024.
- Lewis Smith. The ‘strong’ feature hypothesis could be wrong. AI Alignment Forum, 2024. URL https://www.alignmentforum.org/posts/tojtPCCRpKLSHBdpn/the-strong-feature-hypothesis-could-be-wrong.
- Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
- Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html.
- Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023.
- Information theoretic measures for clusterings comparison: is a correction for chance necessary? In Proceedings of the 26th annual international conference on machine learning, pp. 1073–1080, 2009.
- Linear discriminant analysis. Robust data mining, pp. 27–33, 2013.
- G. Udny Yule. On the methods of measuring association between two attributes. Journal of the Royal Statistical Society, 75(6):579–652, 1912.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.