Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Usable XAI: 10 Strategies Towards Exploiting Explainability in the LLM Era (2403.08946v1)

Published 13 Mar 2024 in cs.LG, cs.CL, and cs.CY

Abstract: Explainable AI (XAI) refers to techniques that provide human-understandable insights into the workings of AI models. Recently, the focus of XAI is being extended towards LLMs which are often criticized for their lack of transparency. This extension calls for a significant transformation in XAI methodologies because of two reasons. First, many existing XAI methods cannot be directly applied to LLMs due to their complexity advanced capabilities. Second, as LLMs are increasingly deployed across diverse industry applications, the role of XAI shifts from merely opening the "black box" to actively enhancing the productivity and applicability of LLMs in real-world settings. Meanwhile, unlike traditional machine learning models that are passive recipients of XAI insights, the distinct abilities of LLMs can reciprocally enhance XAI. Therefore, in this paper, we introduce Usable XAI in the context of LLMs by analyzing (1) how XAI can benefit LLMs and AI systems, and (2) how LLMs can contribute to the advancement of XAI. We introduce 10 strategies, introducing the key techniques for each and discussing their associated challenges. We also provide case studies to demonstrate how to obtain and leverage explanations. The code used in this paper can be found at: https://github.com/JacksonWuxs/UsableXAI_LLM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (296)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Quantifying and mitigating the impact of label errors on model disparity metrics. arXiv preprint arXiv:2310.02533, 2023.
  3. Evaluating correctness and faithfulness of instruction-following models for question answering. arXiv preprint arXiv:2307.16877, 2023.
  4. Distinguishing the knowable from the unknowable with language models. arXiv preprint arXiv:2402.03563, 2024.
  5. Towards tracing knowledge in language models back to the training data. In Findings of EMNLP, pp.  2429–2446, December 2022.
  6. Where is your evidence: Improving fact-checking by justification modeling. In Proceedings of the first workshop on fact extraction and verification (FEVER), pp.  85–90, 2018.
  7. The falcon series of open language models. arXiv preprint arXiv:2311.16867, 2023.
  8. Rationalization through concepts. Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  761–775, 2021.
  9. Walter Edwin Arnoldi. The principle of minimized iterations in the solution of the matrix eigenvalue problem. Quarterly of applied mathematics, 9(1):17–29, 1951.
  10. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018.
  11. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  12. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
  13. Syntax-bert: Improving pre-trained transformers with syntax trees. arXiv preprint arXiv:2103.04350, 2021.
  14. Characterizing large language model geometry solves toxicity detection and generation. arXiv preprint arXiv:2312.01648, 2023.
  15. Network dissection: Quantifying interpretability of deep visual representations. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6541–6549, 2017.
  16. Gan dissection: Visualizing and understanding generative adversarial networks. In International Conference on Learning Representations, 2018.
  17. Leveraging chatgpt as text annotation tool for sentiment analysis. arXiv preprint arXiv:2306.17177, 2023.
  18. Evaluating layers of representation in neural machine translation on part-of-speech and semantic tagging tasks. arXiv preprint arXiv:1801.07772, 2018.
  19. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  20. Llms as counterfactual explanation modules: Can chatgpt explain black-box text classifiers? arXiv preprint arXiv:2309.13340, 2023.
  21. Language models can explain neurons in language models. URL https://openaipublic. blob. core. windows. net/neuron-explainer/paper/index. html.(Date accessed: 14.05. 2023), 2023.
  22. Mathematical algorithm design for deep learning under societal and judicial constraints: The algorithmic transparency requirement. arXiv preprint arXiv:2401.10310, 2024.
  23. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp.  2206–2240. PMLR, 2022.
  24. Attention approximates sparse distributed memory. Advances in Neural Information Processing Systems, 34:15301–15315, 2021.
  25. Towards monosemanticity: Decomposing language models with dictionary learning. transformer circuits thread, 2023, 2023.
  26. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  27. Rationale-inspired natural language explanations with commonsense. arXiv preprint arXiv:2106.13876, 2021.
  28. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31, 2018.
  29. Localizing lying in llama: Understanding instructed dishonesty on true-false questions through prompting, probing, and patching. arXiv preprint arXiv:2311.15131, 2023.
  30. What to learn, and how: Toward effective learning from rationales. arXiv preprint arXiv:2112.00071, 2021.
  31. Chat gpt: a promising tool for academic editing. Data Metadata, 1:23, 2022.
  32. Do explanations make vqa models more predictable to a human? In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.  1036–1042, 2018.
  33. Interpretable deep models for icu outcome prediction. In AMIA annual symposium proceedings, volume 2016, pp.  371. American Medical Informatics Association, 2016.
  34. Inside: Llms’ internal states retain the power of hallucination detection. In The Twelfth International Conference on Learning Representations, 2023a.
  35. Parameter-efficient fine-tuning design spaces. In The Eleventh International Conference on Learning Representations, 2022.
  36. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  37. Zara: Improving few-shot self-rationalization for small language models. arXiv preprint arXiv:2305.07355, 2023b.
  38. Can large language models provide security & privacy advice? measuring the ability of llms to refute misconceptions. In Proceedings of the 39th Annual Computer Security Applications Conference, pp.  366–378, 2023c.
  39. Explainable recommendation with personalized review retrieval and aspect learning. arXiv preprint arXiv:2306.12657, 2023.
  40. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  41. Benefits and harms of large language models in digital mental health, 2023.
  42. Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092, 2023.
  43. M6-rec: Generative pretrained language models are open-ended recommender systems. arXiv preprint arXiv:2205.08084, 2022.
  44. Sparse autoencoders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023.
  45. Dynamic planning with a llm. arXiv preprint arXiv:2308.06391, 2023.
  46. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8493–8502, 2022.
  47. Chataug: Leveraging chatgpt for text data augmentation. arXiv preprint arXiv:2302.13007, 2023.
  48. Deeper text understanding for ir with contextual neural language modeling. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp.  985–988, 2019.
  49. Analyzing redundancy in pretrained transformer models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  4908–4926, 2020.
  50. A survey of the state of explainable ai for natural language processing. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pp.  447–459, 2020.
  51. Analyzing transformers in embedding space. In Annual Meeting of the Association for Computational Linguistics, 2023.
  52. Improving pretraining techniques for code-switched nlp. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1176–1191, 2023.
  53. Emily L Denton et al. Unsupervised learning of disentangled representations from video. Advances in neural information processing systems, 30, 2017.
  54. A security risk taxonomy for large language models. arXiv preprint arXiv:2311.11415, 2023.
  55. Eraser: A benchmark to evaluate rationalized nlp models. arXiv preprint arXiv:1911.03429, 2019.
  56. Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4443–4458, 2020.
  57. Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495, 2023.
  58. Is gpt-3 a good data annotator? arXiv preprint arXiv:2212.10450, 2022.
  59. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv preprint arXiv:2311.04254, 2023.
  60. Towards a rigorous science of interpretable machine learning. arXiv preprint arXiv:1702.08608, 2017.
  61. Explainable artificial intelligence: A survey. In 2018 41st International convention on information and communication technology, electronics and microelectronics (MIPRO), pp.  0210–0215. IEEE, 2018.
  62. Techniques for interpretable machine learning. Communications of the ACM, 63(1):68–77, 2019a.
  63. Learning credible deep neural networks with rationale regularization. In 2019 IEEE International Conference on Data Mining (ICDM), pp.  150–159. IEEE, 2019b.
  64. Towards interpreting and mitigating shortcut learning behavior of nlu models. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021.
  65. Do llms know about hallucination? an empirical investigation of llm’s hidden states. arXiv preprint arXiv:2402.09733, 2024.
  66. Hotflip: White-box adversarial examples for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  31–36, 2018.
  67. A mathematical framework for transformer circuits. Transformer Circuits Thread, 1, 2021.
  68. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022.
  69. Joseph Enguehard. Sequential integrated gradients: a simple but effective method for explaining language models. arXiv preprint arXiv:2305.15853, 2023.
  70. Knowledge card: Filling llms’ knowledge gaps with plug-in specialized language models. In The Twelfth International Conference on Learning Representations, 2023.
  71. Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770, 2023.
  72. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  73. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665–673, 2020.
  74. Recommendation as language processing (rlp): A unified pretrain, personalized prompt & predict paradigm (p5). In Proceedings of the 16th ACM Conference on Recommender Systems, pp.  299–315, 2022.
  75. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  5484–5495, 2021.
  76. Towards automatic concept-based explanations. Advances in neural information processing systems, 32, 2019.
  77. Chatgpt outperforms crowd-workers for text-annotation tasks. arXiv preprint arXiv:2303.15056, 2023.
  78. Inductive biases for deep learning of higher-level cognition. Proceedings of the Royal Society A, 478(2266):20210068, 2022.
  79. Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296, 2023.
  80. Fastif: Scalable influence functions for efficient model interpretation and debugging. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10333–10350, 2021.
  81. How do multimodal llms really fare in classical vision few-shot challenges? a deep dive. 2023.
  82. Rationalization for explainable nlp: A survey. Frontiers in Artificial Intelligence, 6, 2023.
  83. Retrieval augmented language model pre-training. In International conference on machine learning, pp.  3929–3938. PMLR, 2020.
  84. Which explanation should i choose? a function approximation perspective to characterizing post hoc explanations. Advances in Neural Information Processing Systems, 35:5256–5268, 2022.
  85. Orca: Interpreting prompted language models via locating supporting data evidence in the ocean of pretraining data. arXiv preprint arXiv:2205.12600, 2022.
  86. Explaining black box predictions and unveiling data artifacts through influence functions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5553–5563, 2020.
  87. Improving sequential model editing with fact retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  11209–11224, 2023.
  88. Evaluating explainable ai: Which algorithmic explanations help users predict model behavior? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5540–5552, 2020.
  89. Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models. Advances in Neural Information Processing Systems, 36, 2024.
  90. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  91. Translation-based recommendation. In Proceedings of the eleventh ACM conference on recommender systems, pp.  161–169, 2017.
  92. Targeted data generation: Finding and fixing model weaknesses. arXiv preprint arXiv:2305.17804, 2023.
  93. Support vector machines. IEEE Intelligent Systems and their applications, 13(4):18–28, 1998.
  94. beta-vae: Learning basic visual concepts with a constrained variational framework. In International conference on learning representations, 2016.
  95. Large language models are reasoning teachers. arXiv preprint arXiv:2212.10071, 2022.
  96. exbert: A visual analysis tool to explore learned representations in transformer models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp.  187–196, 2020.
  97. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  8003–8017. Association for Computational Linguistics, July 2023. doi: 10.18653/v1/2023.findings-acl.507. URL https://aclanthology.org/2023.findings-acl.507.
  98. Chain of explanation: New prompting method to generate quality natural language explanation for implicit hate speech. In Companion Proceedings of the ACM Web Conference 2023, pp.  90–93, 2023a.
  99. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. In Companion Proceedings of the ACM Web Conference 2023, pp.  294–297, 2023b.
  100. Large language models can self-improve. arXiv preprint arXiv:2210.11610, 2022.
  101. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023c.
  102. Can large language models explain themselves? a study of llm-generated self-explanations. arXiv preprint arXiv:2310.11207, 2023d.
  103. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
  104. Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks. arXiv preprint arXiv:2311.12786, 2023.
  105. Shashank Mohan Jain. Hugging face. In Introduction to Transformers for NLP: With the Hugging Face Library and Models to Solve Problems, pp.  51–67. Springer, 2022.
  106. What does bert learn about the structure of language? In ACL 2019-57th Annual Meeting of the Association for Computational Linguistics, 2019.
  107. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  108. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  109. Gender biases and where to find them: Exploring gender bias in pre-trained transformer-based language models using movement pruning. arXiv preprint arXiv:2207.02463, 2022.
  110. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.  15696–15707. PMLR, 2023.
  111. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  112. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  252–262, 2018.
  113. Decomposed prompting: A modular approach for solving complex tasks. In The Eleventh International Conference on Learning Representations, 2022.
  114. Sure: Improving open-domain question answering of llms via summarized retrieval. In The Twelfth International Conference on Learning Representations, 2023a.
  115. The cot collection: Improving zero-shot and few-shot learning of language models via chain-of-thought fine-tuning. arXiv preprint arXiv:2305.14045, 2023b.
  116. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp.  267–280.
  117. Understanding black-box predictions via influence functions. In International conference on machine learning, pp.  1885–1894. PMLR, 2017.
  118. Concept bottleneck models. In International conference on machine learning, pp.  5338–5348. PMLR, 2020.
  119. Bert meets shapley: Extending shap explanations to transformer-based classifiers. In Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation, pp.  16–21, 2021.
  120. Post hoc explanations of language models can improve language models. arXiv preprint arXiv:2305.11426, 2023.
  121. Are large language models post hoc explainers? arXiv preprint arXiv:2310.05797, 2023.
  122. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023.
  123. Fine-tuning chatgpt for automatic scoring. Computers and Education: Artificial Intelligence, pp.  100210, 2024.
  124. A mechanistic understanding of alignment algorithms: A case study on dpo and toxicity. arXiv preprint arXiv:2401.01967, 2024.
  125. Multimodality of ai for education: Towards artificial general intelligence. arXiv preprint arXiv:2312.06037, 2023.
  126. Recexplainer: Aligning large language models for recommendation model interpretability. arXiv preprint arXiv:2311.10947, 2023.
  127. Self-detoxifying language models via toxification reversal. arXiv preprint arXiv:2310.09573, 2023.
  128. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  129. Large language models with controllable working memory. arXiv preprint arXiv:2211.05110, 2022a.
  130. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023a.
  131. Ucepic: Unifying aspect planning and lexical constraints for generating explanations in recommendation. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  1248–1257, 2023b.
  132. Visualizing and understanding neural models in nlp. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  681–691, 2016a.
  133. Understanding neural networks through representation erasure. arXiv preprint arXiv:1612.08220, 2016b.
  134. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6449–6464, 2023c.
  135. The dawn after the dark: An empirical study on factuality hallucination in large language models. arXiv preprint arXiv:2401.03205, 2024a.
  136. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024b.
  137. Generate neural template explanations for recommendation. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, pp.  755–764, 2020.
  138. Personalized prompt learning for explainable recommendation. ACM Transactions on Information Systems, 41(4):1–26, 2023d.
  139. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions. In Proceedings of the European Conference on Computer Vision (ECCV), pp.  552–567, 2018.
  140. Explanations from large language models make small reasoners better. arXiv preprint arXiv:2210.06726, 2022b.
  141. Open the pandora’s box of llms: Jailbreaking llms through representation engineering. arXiv preprint arXiv:2401.06824, 2024c.
  142. Faithfulness in natural language generation: A systematic survey of analysis, evaluation and optimization methods. arXiv preprint arXiv:2203.05227, 2022c.
  143. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023e.
  144. A survey on fairness in large language models. arXiv preprint arXiv:2308.10149, 2023f.
  145. Prompt tuning pushes farther, contrastive learning pulls closer: A two-stage approach to mitigate social biases. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  14254–14267, 2023g.
  146. Does circuit analysis interpretability scale? evidence from multiple choice capabilities in chinchilla. arXiv preprint arXiv:2307.09458, 2023.
  147. Anonymisation models for text data: State of the art, challenges and future directions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  4188–4203, 2021.
  148. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024.
  149. Adversarial detection with model interpretation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  1803–1811, 2018.
  150. Adversarial attacks and defenses: An interpretation perspective. ACM SIGKDD Explorations Newsletter, 23(1):86–99, 2021.
  151. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023a.
  152. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023b.
  153. Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023c.
  154. Controlllm: Augment language models with tools by searching on graphs. arXiv preprint arXiv:2310.17796, 2023d.
  155. Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
  156. Intelligible models for classification and regression. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.  150–158, 2012.
  157. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  158. A unified approach to interpreting model predictions. Advances in neural information processing systems, 30, 2017.
  159. From understanding to utilization: A survey on explainability for large language models. arXiv preprint arXiv:2401.12874, 2024.
  160. Xal: Explainable active learning makes classifiers better low-resource learners. arXiv preprint arXiv:2310.05502, 2023.
  161. Faithful chain-of-thought reasoning. arXiv preprint arXiv:2301.13379, 2023.
  162. Deciphering stereotypes in pre-trained language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  11328–11345, 2023.
  163. Teaching small language models to reason. arXiv preprint arXiv:2212.08410, 2022.
  164. Why is the current xai not meeting the expectations? Communications of the ACM, 66(12):20–23, 2023.
  165. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9802–9822, 2023.
  166. The geometry of truth: Emergent linear structure in large language model representations of true/false datasets. arXiv preprint arXiv:2310.06824, 2023.
  167. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022a.
  168. Mass-editing memory in a transformer. In The Eleventh International Conference on Learning Representations, 2022b.
  169. Visual classification via description from large language models. In The Eleventh International Conference on Learning Representations, 2022.
  170. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436, 2023.
  171. Can llms keep a secret? testing privacy implications of language models via contextual integrity theory. In The Twelfth International Conference on Learning Representations, 2023.
  172. Exploring the role of bert token representations to explain sentence probing results. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  792–806, 2021.
  173. Explaining nonlinear classification decisions with deep taylor decomposition. Pattern recognition, 65:211–222, 2017.
  174. Layer-wise relevance propagation: an overview. Explainable AI: interpreting, explaining and visualizing deep learning, pp.  193–209, 2019.
  175. Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences, 116(44):22071–22080, 2019.
  176. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  177. Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035, 2023.
  178. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
  179. R OpenAI. Gpt-4 technical report. arXiv, pp.  2303–08774, 2023.
  180. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  181. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2463–2473, 2019.
  182. How context affects language models’ factual predictions. arXiv preprint arXiv:2005.04611, 2020.
  183. Estimating training data influence by tracing gradient descent. Advances in Neural Information Processing Systems, 33:19920–19930, 2020.
  184. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476, 2023.
  185. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  186. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.  8748–8763. PMLR, 2021.
  187. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  188. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4932–4942, 2019.
  189. From prejudice to parity: A new approach to debiasing large language model word embeddings. arXiv preprint arXiv:2402.11512, 2024.
  190. Investigating the factual knowledge boundary of large language models with retrieval augmentation. arXiv preprint arXiv:2307.11019, 2023.
  191. " why should i trust you?" explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  1135–1144, 2016.
  192. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  193. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics, 8:842–866, 2021.
  194. Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature machine intelligence, 1(5):206–215, 2019.
  195. Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistic Surveys, 16:1–85, 2022.
  196. Catfood: Counterfactual augmented training for improving out-of-domain performance and calibration. arXiv preprint arXiv:2309.07822, 2023.
  197. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
  198. Movement pruning: Adaptive sparsity by fine-tuning. Advances in Neural Information Processing Systems, 33:20378–20389, 2020.
  199. Discretized integrated gradients for explaining language models. arXiv preprint arXiv:2108.13654, 2021.
  200. Polysemanticity and capacity in neural networks. arXiv preprint arXiv:2210.01892, 2022.
  201. Scaling up influence functions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  8179–8186, 2022.
  202. Bridging the human-ai knowledge gap: Concept discovery and transfer in alphazero. arXiv preprint arXiv:2310.16410, 2023.
  203. Find: A function description benchmark for evaluating interpretability methods. Advances in Neural Information Processing Systems, 36, 2024.
  204. Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.
  205. Taking features out of superposition with sparse autoencoders. In AI Alignment Forum, 2022.
  206. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36, 2024.
  207. Mededit: Model editing for medical question answering with external knowledge bases. arXiv preprint arXiv:2309.16035, 2023.
  208. Learning important features through propagating activation differences. In International conference on machine learning, pp.  3145–3153. PMLR, 2017.
  209. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  210. Integrated directional gradients: Feature interaction attribution for neural nlp models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  865–878, 2021.
  211. Augmenting interpretable models with large language models during training. Nature Communications, 14(1):7913, 2023.
  212. Improving the domain adaptation of retrieval augmented generation (rag) models for open domain question answering. Transactions of the Association for Computational Linguistics, 11:1–17, 2023.
  213. Steven A Sloman. The empirical case for two systems of reasoning. Psychological bulletin, 119(1):3, 1996.
  214. Yan-Yan Song and LU Ying. Decision tree methods: applications for classification and prediction. Shanghai archives of psychiatry, 27(2):130, 2015.
  215. Supervising model attention with human explanations for robust natural language inference. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  11349–11357, 2022.
  216. Keith E Stanovich. Who is rational?: Studies of individual differences in reasoning. Psychology Press, 1999.
  217. Api is enough: Conformal prediction for large language models without logit-access. arXiv preprint arXiv:2403.01216, 2024.
  218. End-to-end memory networks. Advances in neural information processing systems, 28, 2015.
  219. Axiomatic attribution for deep networks. In International conference on machine learning, pp.  3319–3328. PMLR, 2017.
  220. Can chatgpt replace traditional kbqa models? an in-depth analysis of the question answering performance of the gpt llm family. In International Semantic Web Conference, pp.  348–367. Springer, 2023.
  221. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360, 2023a.
  222. Explain-then-translate: an analysis on improving program translation with self-generated explanations. arXiv preprint arXiv:2311.07070, 2023b.
  223. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations, 2018.
  224. Language models get a gender makeover: Mitigating gender bias with few-shot data interventions. arXiv preprint arXiv:2306.04597, 2023.
  225. A survey on explainable artificial intelligence (xai): Toward medical xai. IEEE transactions on neural networks and learning systems, 32(11):4793–4813, 2020.
  226. Function vectors in large language models. arXiv preprint arXiv:2310.15213, 2023.
  227. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  228. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  229. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
  230. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
  231. Jesse Vig. Bertviz: A tool for visualizing multihead self-attention in the bert model. In ICLR workshop: Debugging machine learning models, volume 23, 2019.
  232. Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  5797–5808, 2019.
  233. Analyzing the source and target contributions to predictions in neural machine translation. arXiv preprint arXiv:2010.10907, 2020.
  234. Counterfactual explanations without opening the black box: Automated decisions and the gdpr. Harv. JL & Tech., 31:841, 2017.
  235. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  7534–7550, 2020.
  236. walkerspider. Dan is my new friend. https://www.reddit.com/r/ChatGPT/comments/zlcyr9/dan_is_my_new_friend/, 2022. [Accessed 27-02-2024].
  237. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
  238. Robust natural language understanding with residual attention debiasing. arXiv preprint arXiv:2305.17627, 2023a.
  239. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small. In The Eleventh International Conference on Learning Representations, 2022a.
  240. Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160, 2023b.
  241. Reducing spurious correlations in aspect-based sentiment analysis with explanation from large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  2930–2941, 2023c.
  242. Knowledge editing for large language models: A survey. arXiv preprint arXiv:2310.16218, 2023d.
  243. Retrieval-augmented multilingual knowledge editing. arXiv preprint arXiv:2312.13040, 2023e.
  244. Finding skill neurons in pre-trained transformer-based language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11132–11152, 2022b.
  245. Chain-of-thought reasoning without prompting. arXiv preprint arXiv:2402.10200, 2024a.
  246. Chain-of-thought reasoning without prompting, 2024b.
  247. Robustness to spurious correlations in text classification via automatically generated counterfactuals. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  14024–14031, 2021.
  248. Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents. Advances in Neural Information Processing Systems, 36, 2024.
  249. Generating valid and natural adversarial examples with large language models. arXiv preprint arXiv:2311.11861, 2023f.
  250. Learning from explanations with neural execution tree. In International Conference on Learning Representations, 2019.
  251. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  252. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  253. Naturalprover: Grounded mathematical proof generation with language models. Advances in Neural Information Processing Systems, 35:4913–4927, 2022.
  254. Autocad: Automatically generating counterfactuals for mitigating shortcut learning. arXiv preprint arXiv:2211.16202, 2022.
  255. Llm-powered data augmentation for enhanced crosslingual performance. arXiv preprint arXiv:2305.14288, 2023.
  256. Improving vqa and its explanations\\\backslash\\\\backslash\by comparing competing explanations. arXiv preprint arXiv:2006.15631, 2020a.
  257. From language modeling to instruction following: Understanding the behavior shift in llms after instruction tuning, 2023.
  258. On explaining your explanations of bert: An empirical study with sequence classification. arXiv preprint arXiv:2101.00196, 2021.
  259. Perturbed masking: Parameter-free probing for analyzing and interpreting bert. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  4166–4176, 2020b.
  260. Less: Selecting influential data for targeted instruction tuning. arXiv preprint arXiv:2402.04333, 2024.
  261. Contrastive novelty-augmented learning: Anticipating outliers with large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  11778–11801, 2023a.
  262. Adversarial attacks and defenses in images, graphs and text: A review. International Journal of Automation and Computing, 17:151–178, 2020.
  263. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023b.
  264. Understanding and detecting hallucinations in neural machine translation via model introspection. Transactions of the Association for Computational Linguistics, 11:546–564, 2023c.
  265. Hallucination is inevitable: An innate limitation of large language models. arXiv preprint arXiv:2401.11817, 2024.
  266. Evaluating explanation without ground truth in interpretable machine learning. arXiv preprint arXiv:1907.06831, 2019.
  267. Bias a-head? analyzing bias in transformer-based language model attention heads. arXiv preprint arXiv:2311.10395, 2023a.
  268. Language in a bottle: Language model guided concept bottlenecks for interpretable image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  19187–19197, 2023b.
  269. Large language model can interpret latent space of sequential recommender. arXiv preprint arXiv:2310.20487, 2023c.
  270. Beyond labels: Empowering human annotators with natural language explanations through a novel active-learning architecture. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  11629–11643, 2023a.
  271. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023b.
  272. Xi Ye and Greg Durrett. Can explanations be useful for calibrating black box models? In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6199–6212, 2022a.
  273. Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392, 2022b.
  274. Benchmarking knowledge boundary for large language model: A different perspective on model evaluation. arXiv preprint arXiv:2402.11493, 2024a.
  275. Should we respect llms? a cross-lingual study on the influence of prompt politeness on llm performance. arXiv preprint arXiv:2402.14531, 2024b.
  276. Making retrieval-augmented language models robust to irrelevant context. arXiv preprint arXiv:2310.01558, 2023.
  277. Unlearning bias in language models by partitioning gradients. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  6032–6048, 2023.
  278. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems, 33:9422–9434, 2020.
  279. Explainability in graph neural networks: A taxonomic survey. IEEE transactions on pattern analysis and machine intelligence, 45(5):5782–5799, 2022.
  280. Post-hoc concept bottleneck models. In The Eleventh International Conference on Learning Representations, 2022.
  281. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms. arXiv preprint arXiv:2401.06373, 2024.
  282. Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering, 19(1):27–39, 2018.
  283. Multimodal chain-of-thought reasoning in language models. arXiv preprint arXiv:2302.00923, 2023.
  284. Automated natural language explanation of deep visual neurons with large models. arXiv preprint arXiv:2310.10708, 2023a.
  285. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 2023b.
  286. Opening the black box of large language models: Two views on holistic interpretability. arXiv preprint arXiv:2402.10688, 2024.
  287. Lirex: Augmenting language inference with relevant explanations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  14532–14539, 2021.
  288. Interpretation of time-series deep models: A survey. arXiv preprint arXiv:2305.14582, 2023c.
  289. Mquake: Assessing knowledge editing in language models via multi-hop questions. arXiv preprint arXiv:2305.14795, 2023.
  290. S3-Rec: Self-supervised learning for sequential recommendation with mutual information maximization. In CIKM, pp.  1893–1902, 2020.
  291. Causal inference in recommender systems: A survey of strategies for bias mitigation, explanation, and generalization. arXiv preprint arXiv:2301.00910, 2023a.
  292. Collaborative large language model for recommender systems. In The Web Conference, 2024.
  293. Large language models can learn rules. arXiv preprint arXiv:2310.07064, 2023b.
  294. Interpretable ranking with generalized additive models. In WSDM, 2021.
  295. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023a.
  296. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xuansheng Wu (21 papers)
  2. Haiyan Zhao (42 papers)
  3. Yaochen Zhu (23 papers)
  4. Yucheng Shi (30 papers)
  5. Fan Yang (877 papers)
  6. Tianming Liu (161 papers)
  7. Xiaoming Zhai (48 papers)
  8. Wenlin Yao (38 papers)
  9. Jundong Li (126 papers)
  10. Mengnan Du (90 papers)
  11. Ninghao Liu (98 papers)
Citations (23)