Polysemanticity and Capacity in Neural Networks (2210.01892v3)

Published 4 Oct 2022 in cs.NE, cs.AI, and cs.LG

Abstract: Individual neurons in neural networks often represent a mixture of unrelated features. This phenomenon, called polysemanticity, can make interpreting neural networks more difficult and so we aim to understand its causes. We propose doing so through the lens of feature \emph{capacity}, which is the fractional dimension each feature consumes in the embedding space. We show that in a toy model the optimal capacity allocation tends to monosemantically represent the most important features, polysemantically represent less important features (in proportion to their impact on the loss), and entirely ignore the least important features. Polysemanticity is more prevalent when the inputs have higher kurtosis or sparsity and more prevalent in some architectures than others. Given an optimal allocation of capacity, we go on to study the geometry of the embedding space. We find a block-semi-orthogonal structure, with differing block sizes in different models, highlighting the impact of model architecture on the interpretability of its neurons.

References (4)

Citations (19)

View on Semantic Scholar

Summary

The paper demonstrates that neural networks allocate feature capacity optimally, representing key features monosemantically and less important ones polysemantically.
The authors develop a quadratic toy model and numerical phase diagrams to reveal how shifts in sparsity influence the transition between ignored, polysemantic, and monosemantic features.
Empirical results validate the theoretical predictions, offering practical insights for designing more interpretable and efficient neural network models.

Polysemanticity and Capacity in Neural Networks: An Insightful Overview

The paper "Polysemanticity and Capacity in Neural Networks" by Scherlis et al. explores understanding how neural networks allocate capacity to features and the implications for polysemanticity, a phenomenon where individual neurons represent multiple unrelated input features.

Key Concepts and Hypotheses

The authors propose analyzing polysemanticity through the lens of feature capacity, defined as the fractional dimension each feature consumes in the embedding space. They hypothesize that the optimal allocation of capacity should monosemantically represent the most important features, polysemantically represent less important ones, and ignore the least important features. Polysemanticity is postulated to be more prevalent with higher input kurtosis or sparsity and more in certain architectures.

Analytical Toy Model

In their analytical exploration, the authors develop a quadratic toy model where capacity allocation is treated as a constrained optimization problem. They show that the optimal capacity distribution tends to minimize loss by balancing the marginal benefits of representing each feature monosemantically or polysemantically. Key results from this model include:

Features are either ignored, monosemantically represented, or polysemantically represented based on their importance.
When there is polysemantic representation, it occurs such that the marginal loss reduction for additional capacity for each feature becomes constant.

Numerical Phase Diagrams

The paper constructs phase diagrams via numerical experiments using both quadratic and other nonlinear activation functions like ReLU and GeLU. The phase diagrams categorize features into ignored, polysemantically represented, or monosemantically represented based on their importance and sparsity. Significant findings include:

At high sparsity, features smoothly transition from ignored to polysemantic to monosemantic representation.
At low sparsity, features sharply transition from ignored to monosemantic representation without an intermediate polysemantic phase.
Empirical results align well with theoretical predictions, validating the phase diagram model.

Geometry of Embedding Matrices

The authors also paper the geometry of efficient embedding matrices, which fully utilize available embedding dimensions. They establish that efficient matrices exhibit a block-semi-orthogonal structure, indicating that features are distributed across orthogonal subspaces within the embedding space. Key insights include:

Large blocks in the embedding matrices offer significant flexibility in allocating capacity but render embeddings less interpretable due to higher polysemanticity.
Small blocks constrain capacity allocation but allow individual embedding vectors more freedom in terms of length, potentially aiding interpretability.

Practical and Theoretical Implications

The findings imply practical strategies for managing and controlling polysemanticity. For example, one might adjust the activation functions or architecture to influence the size and structure of blocks in the embedding matrix, thus better controlling how features are represented. Theoretical implications suggest a more nuanced understanding of how neural networks distribute capacity can lead to more interpretable and efficient models.

Future Directions

Future research could explore variations in model architectures and activation functions further to manage polysemanticity effectively. Additionally, employing these insights in more complex, real-world models beyond toy examples could validate the broader applicability of the findings.

Conclusion

This paper offers significant insights into the phenomenon of polysemanticity through a novel framework of capacity in neural networks. By rigorously analyzing both theoretical models and empirical results, the authors lay a foundation for more interpretable neural network designs and control over feature representation. This work marks a meaningful step towards understanding and potentially mitigating the complexity of neural network interpretation.

PDF Markdown