Towards Generating Informative Textual Description for Neurons in Language Models (2401.16731v1)
Abstract: Recent developments in transformer-based LLMs have allowed them to capture a wide variety of world knowledge that can be adapted to downstream tasks with limited resources. However, what pieces of information are understood in these models is unclear, and neuron-level contributions in identifying them are largely unknown. Conventional approaches in neuron explainability either depend on a finite set of pre-defined descriptors or require manual annotations for training a secondary model that can then explain the neurons of the primary model. In this paper, we take BERT as an example and we try to remove these constraints and propose a novel and scalable framework that ties textual descriptions to neurons. We leverage the potential of generative LLMs to discover human-interpretable descriptors present in a dataset and use an unsupervised approach to explain neurons with these descriptors. Through various qualitative and quantitative analyses, we demonstrate the effectiveness of this framework in generating useful data-specific descriptors with little human involvement in identifying the neurons that encode these descriptors. In particular, our experiment shows that the proposed approach achieves 75% precision@2, and 50% recall@2
- Andreas Köpf, et. al. 2023. OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5. https://huggingface.co/OpenAssistant/oasst-sft-4-pythia-12b-epoch-3.5.
- Network Dissection: Quantifying Interpretability of Deep Visual Representations. arXiv:1704.05796.
- Bowman, S. R. 2023. Eight Things to Know about Large Language Models. arXiv:2304.00612.
- Language Models are Few-Shot Learners.
- Scaling Instruction-Finetuned Language Models.
- What Is One Grain of Sand in the Desert? Analyzing Individual Neurons in Deep NLP Models. arXiv:1812.09355.
- Discovering Latent Concepts Learned in BERT.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.
- Natural Language Descriptions of Deep Visual Features. CoRR, abs/2201.11114.
- Visualizing and Understanding Recurrent Networks.
- Representation of linguistic form and function in recurrent neural networks.
- ”Oops, Did I Just Say That?” Testing and Repairing Unethical Suggestions of Large Language Models with Suggest-Critique-Reflect Process. arXiv:2305.02626.
- Compositional Explanations of Neurons. arXiv:2006.14032.
- Discovery of Natural Language Concepts in Individual Units of CNNs.
- WT5?! Training Text-to-Text Models to Explain their Predictions.
- Justifying Recommendations using Distantly-Labeled Reviews and Fine-Grained Aspects. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 188–197. Association for Computational Linguistics.
- Cluster Labeling by Word Embeddings and WordNet’s Hypernymy. In Proceedings of the Australasian Language Technology Association Workshop 2018, 66–70. Dunedin, New Zealand.
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.
- Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Neuron-level Interpretation of Deep NLP Models: A Survey. CoRR, abs/2108.13138.
- Self-Diagnosis and Self-Debiasing: A Proposal for Reducing Corpus-Based Bias in NLP. arXiv:2103.00453.
- Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps.
- Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825.
- Intrinsic Probing through Dimension Selection. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 197–216. Online: Association for Computational Linguistics.
- Attention Is All You Need.
- Emergent abilities of large language models. TMLR.