The Geometry of Concepts: Sparse Autoencoder Feature Structure (2410.19750v2)

Published 10 Oct 2024 in q-bio.NC, cs.AI, and cs.LG

Abstract: Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by LLMs. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer.

References (31)

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that sparse autoencoders reveal multi-scale geometric structures—atomic, brain, and galaxy scales—in large-scale language models.
It employs techniques like LDA for distractor removal and geometric clustering to enhance interpretability and functional localization of learned features.
The study challenges isotropic assumptions by uncovering a power-law eigenvalue distribution, offering a framework to understand hierarchical concept representations.

Sparse Autoencoder Feature Structure in Large-Scale LLMs

The paper "The Geometry of Concepts: Sparse Autoencoder Feature Structure" investigates the latent geometric structure of concepts within large-scale LLMs, analyzed through sparse autoencoders (SAEs). Sparse autoencoders have gained attention for revealing interpretable features within neural network models and provide a window into understanding the activation space as a higher-dimensional geometric structure. This research identifies three distinctive spatial scales within this feature space: atomic, brain, and galaxy scales, each corresponding to levels of complexity in interpretability and functional localization.

The atomic scale reveals small-scale geometric structures—referred to as "crystals"—within SAE feature spaces. These structures are characterized by parallelograms and trapezoids that represent semantic relations, akin to the analogy well-known in word embeddings such as (man:woman::king:queen). The paper demonstrates that the quality and accuracy of such structures are significantly enhanced by removing semantically irrelevant 'distractor' dimensions, utilizing Linear Discriminant Analysis (LDA) to project onto a more meaningful subspace. The discovery and manipulation of these crystals suggest a generalized framework for detecting informal concept analogies within the learned representations of LLMs.

The intermediate, or brain scale, investigates modular structures of learned features that exhibit functional similarity, paralleling neurological organizational principles in biological brains. The research identifies clusters of features that tend to co-activate within similar contexts, which translate into geometric proximity in the feature space, forming what the authors refer to as functional lobes. Using co-occurrence measures like Phi coefficient alongside geometric clustering techniques, the researchers quantify spatial modularity and correlate it with document-level data, distinguishing, for instance, geometric compartments associated with code/math versus narrative text features.

At the galaxy scale, the point cloud representing the aggregated SAE feature space exhibits anisotropic distribution, contrary to the assumption of an isotropic Gaussian arrangement. This large-scale structure is characterized by eigenvalues of the point cloud's covariance matrix obeying a power law, predominantly in the model's middle layers. The existence of such a power law reflects a non-random, organized emergence of complex features and abstractions that demonstrate a layered bottleneck in the neural network, aligning with theories of hierarchical representation learning.

The paper's empirical findings illuminate valuable insights for the theoretical understanding of how complex concepts are structured and stored within neural networks and sparse autoencoders. The paper not only challenges existing assumptions about isotropic feature distribution but also provides compelling evidence on the importance of spatial modularity and its implications for efficiently storing and retrieving learned concepts. As researchers continue to explore the intricacies of large-scale LLMs, these findings offer a robust foundation for future inquiries into the computational mechanisms underlying human-like linguistic comprehension and abstraction in artificial intelligence systems. Further investigations might focus on refining techniques for distractor removal and extending these concepts to other types of neural architectures or subject domains, paving the way for models with enhanced interpretability and practical efficacy in real-world applications.