Using Degeneracy in the Loss Landscape for Mechanistic Interpretability (2405.10927v2)
Abstract: Mechanistic Interpretability aims to reverse engineer the algorithms implemented by neural networks by studying their weights and activations. An obstacle to reverse engineering neural networks is that many of the parameters inside a network are not involved in the computation being implemented by the network. These degenerate parameters may obfuscate internal structure. Singular learning theory teaches us that neural network parameterizations are biased towards being more degenerate, and parameterizations with more degeneracy are likely to generalize further. We identify 3 ways that network parameters can be degenerate: linear dependence between activations in a layer; linear dependence between gradients passed back to a layer; ReLUs which fire on the same subset of datapoints. We also present a heuristic argument that modular networks are likely to be more degenerate, and we develop a metric for identifying modules in a network that is based on this argument. We propose that if we can represent a neural network in a way that is invariant to reparameterizations that exploit the degeneracies, then this representation is likely to be more interpretable, and we provide some evidence that such a representation is likely to have sparser interactions. We introduce the Interaction Basis, a tractable technique to obtain a representation that is invariant to degeneracies from linear dependence of activations or Jacobians.
- Miki Aoyagi. Consideration on the learning efficiency of multiple-layered neural networks with linear units. Neural Networks, 172:106132, 04 2024. doi: 10.1016/j.neunet.2024.106132.
- The local interaction basis: A loss landscape-motivated method for interpretability, May 2024. URL https://publications.apolloresearch.ai/local_interaction_basis.
- Liam Carrol. Phase transitions in neural networks. Master’s thesis, School of Computing and Information Systems, The University of Melbourne, October 2021. URL http://therisingsea.org/notes/MSc-Carroll.pdf.
- Liam Carroll. Dslt 1. the rlct measures the effective dimension of neural networks, Jun 2023. URL https://www.alignmentforum.org/posts/4eZtmwaqhAgdJQDEg/dslt-1-the-rlct-measures-the-effective-dimension-of-neural.
- Causal scrubbing: A method for rigorously testing interpretability hypotheses. Alignment Forum, 2022. URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing.
- Dynamical versus bayesian phase transitions in a toy model of superposition. arXiv preprint arXiv:2310.06301, 2023.
- Formalizing the presumption of independence. arXiv preprint arXiv:2211.06738, 2022.
- The evolutionary origins of modularity. Proceedings of the Royal Society B: Biological Sciences, 280(1755):20122863, March 2013. ISSN 1471-2954. doi: 10.1098/rspb.2012.2863. URL http://dx.doi.org/10.1098/rspb.2012.2863.
- Towards automated circuit discovery for mechanistic interpretability, 2023.
- Towards automated circuit discovery for mechanistic interpretability. Advances in Neural Information Processing Systems, 36, 2024.
- Essentially no barriers in neural network energy landscape, 2019.
- A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021. https://transformer-circuits.pub/2021/framework/index.html.
- Matthew Farrugia-Roberts. Structural degeneracy in neural networks. Master’s thesis, School of Computing and Information Systems, The University of Melbourne, December 2022. URL https://far.in.net/mthesis.
- Clusterability in neural networks, 2021.
- Why neurons mix: high dimensionality for higher cognition. Current Opinion in Neurobiology, 37:66–74, 2016. ISSN 0959-4388. doi: https://doi.org/10.1016/j.conb.2016.01.010. URL https://www.sciencedirect.com/science/article/pii/S0959438816000118. Neurobiology of cognitive behavior.
- Causal abstractions of neural networks. Advances in Neural Information Processing Systems, 34:9574–9586, 2021.
- Transformer Feed-Forward Layers Are Key-Value Memories, September 2021. URL http://arxiv.org/abs/2012.14913. arXiv:2012.14913 [cs].
- Multimodal neurons in artificial neural networks. Distill, 2021. doi: 10.23915/distill.00030. https://distill.pub/2021/multimodal-neurons.
- A kronecker-factored approximate fisher matrix for convolution layers, 2016.
- Studying large language model generalization with influence functions, 2023.
- Jesse Hoogland. Neural networks generalise because of this one weird trick. https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick, January 2023.
- Towards developmental interpretability, Jul 2023. URL https://www.alignmentforum.org/posts/TjaeCWvLZtEDAS5Ex/towards-developmental-interpretability.
- The developmental landscape of in-context learning, 2024.
- Spontaneous evolution of modularity and network motifs. Proceedings of the National Academy of Sciences of the United States of America, 102:13773–8, 10 2005. doi: 10.1073/pnas.0503610102.
- Quantifying degeneracy in singular models via the learning coefficient. arXiv preprint arXiv:2308.12108, 2023.
- Visualizing the loss landscape of neural nets, 2018.
- Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd international conference on artificial intelligence and statistics, pages 888–896. PMLR, 2019.
- Seeing is believing: Brain-inspired modular training for mechanistic interpretability, 2023.
- Decoupled weight decay regularization, 2019.
- Optimizing neural networks with kronecker-factored approximate curvature, 2020.
- Kronecker-factored curvature approximations for recurrent neural networks. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=HyMTkQZAb.
- Locating and editing factual associations in gpt, 2023.
- Is sgd a bayesian sampler? well, almost. Journal of Machine Learning Research, 22(79):1–64, 2021.
- Foundations of machine learning. MIT press, 2018.
- Daniel Murfet. Singular learning theory iv: the rlct. http://www.therisingsea.org/notes/metauni/slt4.pdf, April 2020. Lecture notes.
- Kevin P Murphy. Machine Learning: A Probabilistic Perspective. MIT press, 2012.
- Progress measures for grokking via mechanistic interpretability, 2023.
- Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks, 2016.
- Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760, 2018.
- Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
- Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020.
- Toward transparent ai: A survey on interpreting the inner structures of deep neural networks, 2023.
- Gideon Schwarz. Estimating the dimension of a model. The annals of statistics, pages 461–464, 1978.
- Vladimir N. Vapnik. Statistical Learning Theory. Wiley-Interscience, 1998.
- Eigendamage: Structured pruning in the kronecker-factored eigenbasis, 2019.
- Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. arXiv preprint arXiv:2211.00593, 2022.
- Sumio Watanabe. Algebraic geometry and statistical learning theory, volume 25. Cambridge university press, 2009.
- Sumio Watanabe. A widely applicable bayesian information criterion. The Journal of Machine Learning Research, 14(1):867–897, 2013.
- Deep learning is singular, and that’s good. IEEE Transactions on Neural Networks and Learning Systems, 2022.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.