Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Using Degeneracy in the Loss Landscape for Mechanistic Interpretability (2405.10927v2)

Published 17 May 2024 in cs.LG

Abstract: Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Miki Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, 172:106132, 04 2024. doi: 10.1016/j.neunet.2024.106132.
  2. The local interaction basis: A loss landscape-motivated method for interpretability, May 2024. URL https://publications.apolloresearch.ai/local_interaction_basis.
  3. Liam Carrol. Phase transitions in neural networks. Master’s thesis, School of Computing and Information Systems, The University of Melbourne, October 2021. URL http://therisingsea.org/notes/MSc-Carroll.pdf.
  4. Liam Carroll. Dslt 1. the rlct measures the effective dimension of neural networks, Jun 2023. URL https://www.alignmentforum.org/posts/4eZtmwaqhAgdJQDEg/dslt-1-the-rlct-measures-the-effective-dimension-of-neural.
  5. Causal scrubbing: A method for rigorously testing interpretability hypotheses. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
  6. Dynamical versus bayesian phase transitions in a toy model of superposition. arXiv preprint arXiv:2310.06301, 2023.
  7. Formalizing the presumption of independence. arXiv preprint arXiv:2211.06738, 2022.
  8. The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences, 280(1755):20122863, March 2013. ISSN 1471-2954. doi: 10.1098/rspb.2012.2863. URL http://dx.doi.org/10.1098/rspb.2012.2863.
  9. Towards automated circuit discovery for mechanistic interpretability, 2023.
  10. Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024.
  11. Essentially no barriers in neural network energy landscape, 2019.
  12. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
  13. Matthew Farrugia-Roberts. Structural degeneracy in neural networks. Master’s thesis, School of Computing and Information Systems, The University of Melbourne, December 2022. URL https://far.in.net/mthesis.
  14. Clusterability in neural networks, 2021.
  15. Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37:66–74, 2016. ISSN 0959-4388. doi: https://doi.org/10.1016/j.conb.2016.01.010. URL https://www.sciencedirect.com/science/article/pii/S0959438816000118. Neurobiology of cognitive behavior.
  16. Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021.
  17. Transformer Feed-Forward Layers Are Key-Value Memories, September 2021. URL http://arxiv.org/abs/2012.14913. arXiv:2012.14913 [cs].
  18. Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons.
  19. A kronecker-factored approximate fisher matrix for convolution layers, 2016.
  20. Studying large language model generalization with influence functions, 2023.
  21. Jesse Hoogland. Neural networks generalise because of this one weird trick. https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick, January 2023.
  22. Towards developmental interpretability, Jul 2023. URL https://www.alignmentforum.org/posts/TjaeCWvLZtEDAS5Ex/towards-developmental-interpretability.
  23. The developmental landscape of in-context learning, 2024.
  24. Spontaneous evolution of modularity and network motifs. Proceedings of the National Academy of Sciences of the United States of America, 102:13773–8, 10 2005. doi: 10.1073/pnas.0503610102.
  25. Quantifying degeneracy in singular models via the learning coefficient. arXiv preprint arXiv:2308.12108, 2023.
  26. Visualizing the loss landscape of neural nets, 2018.
  27. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd international conference on artificial intelligence and statistics, pages 888–896. PMLR, 2019.
  28. Seeing is believing: Brain-inspired modular training for mechanistic interpretability, 2023.
  29. Decoupled weight decay regularization, 2019.
  30. Optimizing neural networks with kronecker-factored approximate curvature, 2020.
  31. Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyMTkQZAb.
  32. Locating and editing factual associations in gpt, 2023.
  33. Is sgd a bayesian sampler? well, almost. Journal of Machine Learning Research, 22(79):1–64, 2021.
  34. Foundations of machine learning. MIT press, 2018.
  35. Daniel Murfet. Singular learning theory iv: the rlct. http://www.therisingsea.org/notes/metauni/slt4.pdf, April 2020. Lecture notes.
  36. Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.
  37. Progress measures for grokking via mechanistic interpretability, 2023.
  38. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks, 2016.
  39. Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
  40. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
  41. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
  42. Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023.
  43. Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461–464, 1978.
  44. Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
  45. Eigendamage: Structured pruning in the kronecker-factored eigenbasis, 2019.
  46. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
  47. Sumio Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge university press, 2009.
  48. Sumio Watanabe. A widely applicable bayesian information criterion. The Journal of Machine Learning Research, 14(1):867–897, 2013.
  49. Deep learning is singular, and that’s good. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  50. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Citations (5)

Summary

  • The paper reveals that lower Local Learning Coefficient values indicate increased degeneracy, enhancing generalization and universal algorithm implementation.
  • The paper demonstrates that using behavioral loss with finite data adjustment mitigates neuron misalignment and noise in training scenarios.
  • The paper introduces the interaction basis to diagonalize network interactions, fostering clearer modularity and improved interpretability.

Understanding Mechanistic Interpretability and Degeneracy in Neural Networks

Introduction

Mechanistic Interpretability aims to demystify the inner workings of neural networks by understanding the algorithms they implement. Recent research highlights a persistent challenge in this field: neurons within these networks often respond to a variety of unrelated inputs, and the apparent circuits within the models typically lack clear, well-defined boundaries. One underlying factor contributing to this murkiness is the degeneracy of neural networks, meaning that various sets of parameters might achieve the same functionality. This paper investigates the implications of this degeneracy and proposes methods to mitigate its obstructive influence on interpretability.

Singular Learning Theory (SLT) and Effective Parameter Count

In neural networks, degeneracy implies many parameter combinations that yield the same loss. This is especially clear at the network's global minimum where multiple parameter sets might lie within a broad, flat "basin" of the loss landscape. Singular Learning Theory (SLT) helps quantify this degeneracy through the Local Learning Coefficient (LLC).

Key Points:

  • Loss Landscape & Degeneracy: Networks with lower LLC values are more degenerate, indicating these networks generalize better and implement more universal algorithms.
  • Behavioral Loss: To tackle issues of neuron misalignment in real networks, the authors propose the usage of behavioral loss tailored to the specific network under observation.
  • Finite Data SLT: By adjusting the focus from infinite to finite data, the paper elucidates how degeneracies account for noisy variations in practical training scenarios.

Internal Structures Contributing to Degeneracy

The researchers identify three internal network structures that can produce reparameterization freedom, effectively lowering the network's effective parameter count:

  1. Activation Vectors: If the activation vectors in a particular layer do not span the full dimensional space, there will be redundancy.
  2. Jacobians: Similar to activation vectors, if Jacobians don’t span their full potential space, it results in reparameterization freedom.
  3. Synchronized Nonlinearities: Neurons that activate synchronously or have synchronized firing patterns add another layer of degeneracy, especially relevant in networks using ReLU activation functions.

Examples:

  • Low Dimensional Activations: Linear dependencies within the activation space of hidden layers can result in free parameters that don't contribute to network functionality.
  • Synchronized Neurons: If neurons within a layer always activate together, it introduces more redundancy.

Interaction Sparsity from Parameterization Invariance

The authors propose that the degeneracies can lead to more modular and thus more interpretable networks. Greater interaction sparsity can be achieved by choosing a representation invariant to reparameterization:

  1. Low Dimensional Activations: Sparsify network interactions by identifying and effectively ignoring dimensions with little variance.
  2. Synchronized Neurons: Further prune interactions by recognizing pairs or blocks of neurons with synchronized firing patterns.

Modularity and Degeneracy

One intriguing hypothesis is that more modular networks exhibit lower LLCs. Independent free directions within each module of the network contribute to a collectively higher degree of freedom when modules are sparsely connected. The notion is that modular structures can potentially possess more degeneracies, simplifying the network's understanding:

  1. Non-Interacting Modules: Modules that do not share variables maintain independent free directions.
  2. Interactions Adjusted for Strength: Strength of interaction between modules can be analyzed through a logarithmic measure, helping identify modular structure more accurately.

The Interaction Basis

Finally, the idea of the "Interaction Basis" is introduced to represent neural networks in a form that is robust against reparameterization due to low-rank activations or Jacobians. This involves transforming the basis in each layer to:

  1. Eliminate Directions with Zero Eigenvalues: Exclude directions connected to zero eigenvalues in both Gram matrices of activations and Jacobians.
  2. Diagonalize Interactions: Simplify layer transitions to produce bases that align with the network's principle components.

Implications and Future Directions

Practical: The interaction basis technique could potentially sharpen our understanding and visualization of how neural networks process information by reducing redundant parameters.

Theoretical: The findings open avenues for more extensive studies on how network modularity and degeneracy interplay, influencing the ease of network interpretability.

Speculative Future: There is an unexplored potential in applying these techniques to larger, more complex networks, such as state-of-the-art LLMs, potentially making them more interpretable and efficient.

By carefully examining the internal degeneracies and their implications, this paper takes significant steps toward making neural networks more transparent and interpretable. The full impact of these findings will likely unfold as these concepts are further validated and extended in future research.

Youtube Logo Streamline Icon: https://streamlinehq.com