Word Importance Explains How Prompts Affect Language Model Outputs (2403.03028v1)
Abstract: The emergence of LLMs has revolutionized numerous applications across industries. However, their "black box" nature often hinders the understanding of how they make specific decisions, raising concerns about their transparency, reliability, and ethical use. This study presents a method to improve the explainability of LLMs by varying individual words in prompts to uncover their statistical impact on the model outputs. This approach, inspired by permutation importance for tabular data, masks each word in the system prompt and evaluates its effect on the outputs based on the available text scores aggregated over multiple user inputs. Unlike classical attention, word importance measures the impact of prompt words on arbitrarily-defined text scores, which enables decomposing the importance of words into the specific measures of interest--including bias, reading level, verbosity, etc. This procedure also enables measuring impact when attention weights are not available. To test the fidelity of this approach, we explore the effect of adding different suffixes to multiple different system prompts and comparing subsequent generations with different LLMs. Results show that word importance scores are closely related to the expected suffix importances for multiple scoring functions.
- Measuring the similarity between automatically generated topics. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers, pages 22–27, 2014.
- Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances, 2021.
- On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097. doi: 10.1145/3442188.3445922. URL https://doi.org/10.1145/3442188.3445922.
- Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
- What does BERT look at? an analysis of BERT’s attention. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 276–286, Florence, Italy, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-4828. URL https://aclanthology.org/W19-4828.
- A survey of the state of explainable AI for natural language processing. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 447–459, Suzhou, China, December 2020a. Association for Computational Linguistics. URL https://aclanthology.org/2020.aacl-main.46.
- A survey of the state of explainable AI for natural language processing. CoRR, abs/2010.00711, 2020b.
- Explainability for natural language processing. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, KDD ’21, page 4033–4034, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383325. doi: 10.1145/3447548.3470808. URL https://doi.org/10.1145/3447548.3470808.
- Explainable artificial intelligence: A survey. In 2018 41st International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO), volume 162, pages 0210–0215, 2018. URL https://ieeexplore.ieee.org/abstract/document/8400040.
- Allyson Ettinger. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models, 2020.
- Rudolph Flesch. A new readability yardstick. Journal of applied psychology, 32(3):221, 1948.
- Bias and fairness in large language models: A survey, 2023.
- Human trust in artificial intelligence: Review of empirical research. Academy of Management Annals, 14:627–660, 2020.
- Explainable AI: current status and future directions. CoRR, abs/2107.07045, 2021.
- Training and analyzing deep recurrent neural networks. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 1, NIPS’13, page 190–198, Red Hook, NY, USA, 2013. Curran Associates Inc.
- Linguistic dependencies and statistical dependence. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2021. doi: 10.18653/v1/2021.emnlp-main.234. URL https://doi.org/10.18653%2Fv1%2F2021.emnlp-main.234.
- Do attention heads in bert track syntactic dependencies?, 2019.
- Attention is not explanation, 2019.
- Visualizing and understanding recurrent networks, 2015.
- Probing what different nlp tasks teach machines about function word comprehension, 2019.
- Revealing the dark secrets of bert, 2019.
- A survey on fairness in large language models, 2023.
- Linguistic knowledge and transferability of contextual representations, 2019.
- A Unified Approach to Interpreting Model Predictions, volume 30. Curran Associates, Inc., 2017.
- Analyzing how bert performs entity matching. Proc. VLDB Endow., 15(8):1726–1738, apr 2022. ISSN 2150-8097. doi: 10.14778/3529337.3529356. URL https://doi.org/10.14778/3529337.3529356.
- Know what you don’t know: Unanswerable questions for squad, 2018.
- A primer in bertology: What we know about how bert works, 2020.
- Is attention interpretable?, 2019.
- Lloyd S Shapley et al. A value for n-person games. 1953.
- Interpreting deep learning models in natural language processing: A review, 2021.
- The many shapley values for model explanation. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9269–9278. PMLR, 13–18 Jul 2020. URL https://proceedings.mlr.press/v119/sundararajan20b.html.
- What do you learn from context? probing for sentence structure in contextualized word representations, 2019.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Universal adversarial triggers for attacking and analyzing nlp, 2021.
- Attention is not not explanation. arXiv preprint arXiv:1908.04626, 2019.
- Perturbed masking: Parameter-free probing for analyzing and interpreting bert. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4166–4176, 2020.
- Explainable ai: A brief survey on history, research areas, approaches and challenges. In Jie Tang, Min-Yen Kan, Dongyan Zhao, Sujian Li, and Hongying Zan, editors, Natural Language Processing and Chinese Computing, volume 162, pages 563–574. Springer International Publishing, 2019. URL https://link.springer.com/chapter/10.1007/978-3-030-32236-6_51.
- Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.