What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes (2312.03096v3)

Published 5 Dec 2023 in cs.LG, cs.AI, and cs.NE

Abstract: Polysemantic neurons -- neurons that activate for a set of unrelated features -- have been seen as a significant obstacle towards interpretability of task-optimized deep networks, with implications for AI safety. The classic origin story of polysemanticity is that the data contains more ``features" than neurons, such that learning to perform a task forces the network to co-allocate multiple unrelated features to the same neuron, endangering our ability to understand networks' internal processing. In this work, we present a second and non-mutually exclusive origin story of polysemanticity. We show that polysemanticity can arise incidentally, even when there are ample neurons to represent all features in the data, a phenomenon we term \textit{incidental polysemanticity}. Using a combination of theory and experiments, we show that incidental polysemanticity can arise due to multiple reasons including regularization and neural noise; this incidental polysemanticity occurs because random initialization can, by chance alone, initially assign multiple features to the same neuron, and the training dynamics then strengthen such overlap. Our paper concludes by calling for further research quantifying the performance-polysemanticity tradeoff in task-optimized deep neural networks to better understand to what extent polysemanticity is avoidable.

References (8)

Citations (3)

View on Semantic Scholar

Summary

The paper demonstrates that polysemanticity can arise incidentally from l1 regularization and noise, even in networks with ample neurons.
It combines theoretical proofs with numerical simulations to show how sparsity in activations leads to overlapping feature representations.
The findings have practical implications for improving interpretability methods and enhancing AI safety by addressing unexpected neuron behaviors.

Understanding Incidental Polysemanticity in Deep Networks

This essay provides an expert-level overview of the paper titled "What Causes Polysemanticity? An Alternative Origin Story of Mixed Selectivity from Incidental Causes" by Victor Lecomte et al. The paper explores the origins of polysemantic neurons in task-optimized deep neural networks and provides an alternative explanation to the widely accepted hypothesis that polysemanticity arises solely due to constraints on neuron count.

Main Contributions

The authors propose that polysemanticity can also emerge incidentally due to non-task-specific factors, even when the network has a sufficient number of neurons to represent all features individually. They term this phenomenon "incidental polysemanticity" and identify regularization and neural noise as two factors that can induce this effect.

Theoretical and Empirical Analysis

The paper's primary contributions involve the theoretical exploration and empirical validation of incidental polysemanticity. The authors use a combination of mathematical proofs and computational experiments to demonstrate how incidental polysemanticity arises under various conditions.

Regularization-Induced Polysemanticity

Theoretical models show that one cause of incidental polysemanticity is $l_1$ regularization. The paper examines a simplified nonlinear autoencoder that includes $l_1$ regularization. In this model, the regularization term encourages sparsity, leading to a "winner-take-all" dynamic in which certain neurons become disproportionately representative of multiple features. This sparseness, driven by initial random correlations between neurons and features, fosters the emergence of polysemantic neurons.

The authors compute the expected number of polysemantic neurons analytically, showing that the incidence of polysemanticity can be significant even when many neurons are available, as long as regularization is non-negligible.

Noise-Induced Polysemanticity

Another source of incidental polysemanticity is the presence of noise in the hidden layer. The paper examines how different noise distributions affect the network. Specifically, noise with negative excess kurtosis encourages sparsity in the hidden representations, thereby promoting polysemanticity. The analysis outlines that under such noise conditions, neurons initially representing disjoint sets of features might begin to overlap in their activation patterns, leading to polysemanticity.

Numerical Simulations

The empirical results in the paper confirm the theoretical predictions. For example, autoencoders trained with both $l_1$ regularization and bipolar noise display a constant fraction of polysemantic neurons proportional to $n^2/m$ , where $n$ is the number of features and $m$ is the number of neurons.

These findings are consistent across different initializations and configurations, establishing the robustness of their claims. The experiments highlight that the increasing regularization factor $\lambda$ or noise standard deviation $\sigma$ accelerates the sparsification process, thereby enhancing the probability of incidental polysemanticity.

Implications and Future Work

The paper challenges the traditional view that polysemanticity is primarily a result of network capacity limitations, suggesting that training dynamics and incidental factors play a crucial role. This has practical and theoretical implications for mechanistic interpretability and AI safety:

Mechanistic Interpretability: The research underscores the need for interpretability methods that account for incidental polysemanticity. Existing techniques might overlook the contributions of regularization and noise, leading to incomplete or misleading interpretations.
AI Safety: Understanding the roots of polysemanticity aids in designing safer AI systems by helping researchers predict and control the conditions under which complex and potentially hazardous behaviors might arise.

Future Directions

The authors propose several interesting directions for future research:

Performance-Polysemanticity Tradeoff: Quantifying the tradeoff between model performance and the degree of polysemanticity could shed light on whether incidental polysemanticity can be mitigated without significantly compromising task performance.
Distinguishing Polysemanticity Types: Developing techniques to differentiate between necessary and incidental polysemanticity could improve our understanding of neuron activation patterns and enhance model diagnostics.
Intervention Strategies: Investigating interventions during training—such as neuron duplication and minor weight perturbations—could provide practical methods for reducing incidental polysemanticity.

In conclusion, "What Causes Polysemanticity?" by Victor Lecomte et al. provides a novel and detailed examination of polysemanticity in deep neural networks, attributing it not only to architectural constraints but also to incidental factors inherent in the training process. This work opens new avenues for both theoretical explorations and practical applications in understanding and enhancing the interpretability of deep models.

PDF Markdown

Related Papers

Tweets

https://twitter.com/kushal1t/status/1757238380931477719

https://twitter.com/kushal1t/status/1766169879387021570

https://twitter.com/peratham/status/1827062844355178574