Sparsity-Guided Holistic Explanation for LLMs with Interpretable Inference-Time Intervention (2312.15033v1)
Abstract: LLMs have achieved unprecedented breakthroughs in various natural language processing domains. However, the enigmatic ``black-box'' nature of LLMs remains a significant challenge for interpretability, hampering transparent and accountable applications. While past approaches, such as attention visualization, pivotal subnetwork extraction, and concept-based analyses, offer some insight, they often focus on either local or global explanations within a single dimension, occasionally falling short in providing comprehensive clarity. In response, we propose a novel methodology anchored in sparsity-guided techniques, aiming to provide a holistic interpretation of LLMs. Our framework, termed SparseCBM, innovatively integrates sparsity to elucidate three intertwined layers of interpretation: input, subnetwork, and concept levels. In addition, the newly introduced dimension of interpretable inference-time intervention facilitates dynamic adjustments to the model during deployment. Through rigorous empirical evaluations on real-world datasets, we demonstrate that SparseCBM delivers a profound understanding of LLM behaviors, setting it apart in both interpreting and ameliorating model inaccuracies. Codes are provided in supplements.
- CEBaB: Estimating the causal effects of real-world concepts on NLP model behavior. NeurIPS, 35: 17582–17596.
- Combating misinformation in the age of llms: Opportunities and challenges. arXiv:2311.05656.
- Long live the lottery: The existence of winning tickets in lifelong learning. In ICLR.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805.
- Rigging the lottery: Making all tickets winners. In ICML, 2943–2952.
- Attention in natural language processing. IEEE transactions on neural networks and learning systems, 32(10): 4291–4308.
- The state of sparsity in deep neural networks. arXiv:1902.09574.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In ICLR.
- Learning both Weights and Connections for Efficient Neural Network. In NeurIPS, 1135–1143.
- Second order derivatives for network pruning: Optimal brain surgeon. NeurIPS, 5.
- Channel pruning for accelerating very deep neural networks. In Proceedings of ICCV.
- Disinformation Detection: An Evolving Challenge in the Age of LLMs. arXiv:2309.15847.
- Concept bottleneck models. In ICML, 5338–5348.
- The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 4163–4181.
- Block pruning for faster transformers. arXiv:2109.04838.
- Optimal brain damage. In NeurIPS, 598–605.
- Optimal Brain Damage. In Touretzky, D. S., ed., NeurIPS, 598–605. Morgan-Kaufmann.
- Inference-Time Intervention: Eliciting Truthful Answers from a Language Model. arXiv:2306.03341.
- Explanations from large language models make small reasoners better. arXiv:2210.06726.
- DISC: Learning from Noisy Labels via Dynamic Instance-Specific Selection and Correction. In Proceedings of the IEEE/CVF Conference on CVPR, 24070–24079.
- CSGNN: Conquering Noisy Node labels via Dynamic Class-wise Selection. arXiv:2311.11473.
- Improve Interpretability of Neural Networks via Sparse Contrastive Coding. In Findings of the Association for Computational Linguistics: EMNLP 2022, 460–470.
- Sparsity May Cry: Let Us Fail (Current) Sparse Neural Networks Together! arXiv:2303.02141.
- Roberta: A robustly optimized bert pretraining approach. arXiv:1907.11692.
- Learning efficient convolutional networks through network slimming. In Proceedings of ICCV, 2736–2744.
- Interpretability beyond classification output: Semantic bottleneck networks. arXiv:1907.10882.
- A unified approach to interpreting model predictions. NeurIPS, 30.
- Is Sparse Attention more Interpretable? In ACL-IJCNLP, 122–129.
- Are sixteen heads really better than one? NeurIPS, 32.
- Local interpretable model-agnostic explanations for music content analysis. In ISMIR, volume 53, 537–543.
- Label-free Concept Bottleneck Models. In ICLR.
- OpenAI. 2023. GPT-4 Technical Report. arXiv:2303.08774.
- Automatic differentiation in pytorch. In NeurIPS.
- ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD Conference, 1135–1144.
- Explaining NLP Models via Minimal Contrastive Editing (MiCE). In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, 3840–3852.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.
- Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- A Simple and Effective Pruning Approach for Large Language Models. arXiv:2306.11695.
- Interpreting Pretrained Language Models via Concept Bottlenecks. arXiv:2311.05014.
- Graph few-shot class-incremental learning. In Proceedings of the Fifteenth ACM International Conference on WSDM, 987–996.
- Vig, J. 2019. A Multiscale Visualization of Attention in the Transformer Model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, 37–42.
- Intepreting & Improving Pretrained Language Models: A Probabilistic Conceptual Approach. Openreview:id=kwF1ZfHf0W.
- Neural implicit dictionary learning via mixture-of-expert training. In ICML, 22613–22624.
- Noise-Robust Fine-Tuning of Pretrained Language Models via External Guidance. arXiv:2311.01108.
- Contrastive Meta-Learning for Few-shot Node Classification. In Proceedings of the 29th ACM SIGKDD Conference, 2386–2397.
- Knowledge Editing for Large Language Models: A Survey. arXiv:2310.16218.
- Learn-prune-share for lifelong learning. In 2020 ICDM, 641–650. IEEE.
- HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv:1910.03771.
- Polyjuice: Generating Counterfactuals for Explaining, Evaluating, and Improving Models. In ACL-IJCNLP.
- Opt: Open pre-trained transformer language models. arXiv:2205.01068.
- Less is more: Towards compact cnns. In ECCV, 662–677. Springer.
- Large Language Models are Human-Level Prompt Engineers. In ICLR.
- Zhen Tan (68 papers)
- Tianlong Chen (202 papers)
- Zhenyu Zhang (249 papers)
- Huan Liu (283 papers)