Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-Generated Black-box Explanations Can Be Adversarially Helpful (2405.06800v3)

Published 10 May 2024 in cs.CL

Abstract: LLMs are becoming vital tools that help us solve and understand complex problems by acting as digital assistants. LLMs can generate convincing explanations, even when only given the inputs and outputs of these problems, i.e., in a ``black-box'' approach. However, our research uncovers a hidden risk tied to this approach, which we call adversarial helpfulness. This happens when an LLM's explanations make a wrong answer look right, potentially leading people to trust incorrect solutions. In this paper, we show that this issue affects not just humans, but also LLM evaluators. Digging deeper, we identify and examine key persuasive strategies employed by LLMs. Our findings reveal that these models employ strategies such as reframing the questions, expressing an elevated level of confidence, and cherry-picking evidence to paint misleading answers in a credible light. To examine if LLMs are able to navigate complex-structured knowledge when generating adversarially helpful explanations, we create a special task based on navigating through graphs. Most LLMs are not able to find alternative paths along simple graphs, indicating that their misleading explanations aren't produced by only logical deductions using complex knowledge. These findings shed light on the limitations of the black-box explanation setting and allow us to provide advice on the safe usage of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. GPT-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Faithfulness vs. plausibility: On the (un)reliability of explanations from large language models, 2024.
  3. Explanations for CommonsenseQA: New Dataset and Models. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  3050–3065, Online, August 2021a. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.238. URL https://aclanthology.org/2021.acl-long.238.
  4. Explanations for CommonsenseQA: New Dataset and Models. In ACL-IJCNLP, pp.  3050–3065, Online, August 2021b. Association for Computational Linguistics. URL https://aclanthology.org/2021.acl-long.238.
  5. Yi: Open foundation models by 01.ai, 2024.
  6. Anthropic. Introducing Claude, 2023. URL https://www.anthropic.com/news/introducing-claude.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  8. Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687, 2023.
  9. A large annotated corpus for learning natural language inference. In Lluís Màrquez, Chris Callison-Burch, and Jian Su (eds.), Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  632–642, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1075. URL https://aclanthology.org/D15-1075.
  10. Machine explanations and human understanding. arXiv preprint arXiv:2202.04092, 2022.
  11. REV: Information-theoretic evaluation of free-text rationales. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2007–2030, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.112. URL https://aclanthology.org/2023.acl-long.112.
  12. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  13. Cohere. Cohere Command, 2023. URL https://cohere.com/models/command.
  14. Attack prompt generation for red teaming and defending large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 2176–2189, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.143. URL https://aclanthology.org/2023.findings-emnlp.143.
  15. SemEval-2021 task 6: Detection of persuasion techniques in texts and images. In Alexis Palmer, Nathan Schneider, Natalie Schluter, Guy Emerson, Aurelie Herbelot, and Xiaodan Zhu (eds.), Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pp.  70–98, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.semeval-1.7. URL https://aclanthology.org/2021.semeval-1.7.
  16. Everything of thoughts: Defying the law of penrose triangle for thought generation. arXiv:2311.04254, 2023. URL https://arxiv.org/abs/2311.04254.
  17. Faith and fate: Limits of transformers on compositionality. Advances in Neural Information Processing Systems, 36, 2024.
  18. Explainability pitfalls: Beyond dark patterns in explainable ai. arXiv preprint arXiv:2109.12480, 2021.
  19. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  20. Evaluating explainable AI: Which algorithmic explanations help users predict model behavior? In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5540–5552, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.491. URL https://aclanthology.org/2020.acl-main.491.
  21. Is explanation the cure? misinformation mitigation in the short term and long term. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 1313–1323, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.92. URL https://aclanthology.org/2023.findings-emnlp.92.
  22. Can LLMs effectively leverage graph structural information: when and why. arXiv preprint arXiv:2309.16595, 2023a. URL https://arxiv.org/abs/2309.16595.
  23. Rigorously assessing natural language explanations of neurons. arXiv preprint arXiv:2309.10312, 2023b.
  24. Evaluating the utility of model explanations for model development. arXiv preprint arXiv:2312.06032, 2023.
  25. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  26. Michal Kosinski. Evaluating large language models in theory of mind tasks, 2024.
  27. Properties and challenges of llm-generated explanations. arXiv:2402.10532, 2024. URL https://arxiv.org/abs/2402.10532.
  28. ConceptNet—a practical commonsense reasoning tool-kit. BT technology journal, 22(4):211–226, 2004.
  29. Faithful Chain-of-Thought Reasoning. In Jong C. Park, Yuki Arase, Baotian Hu, Wei Lu, Derry Wijaya, Ayu Purwarianti, and Adila Alfa Krisnadhi (eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  305–329, Nusa Dua, Bali, November 2023. Association for Computational Linguistics. URL https://aclanthology.org/2023.ijcnlp-main.20.
  30. Clear: Generative counterfactual explanations on graphs. Advances in Neural Information Processing Systems, 35:25895–25907, 2022.
  31. Rhetorical structure theory: Toward a functional theory of text organization. Text-interdisciplinary Journal for the Study of Discourse, 8(3):243–281, 1988.
  32. Tim Miller. Explainable AI is Dead, Long Live Explainable AI! Hypothesis-driven decision support, March 2023. URL http://arxiv.org/abs/2302.12389.
  33. Presentations by the humans and for the humans: Harnessing LLMs for generating persona-aware slides from documents. In Yvette Graham and Matthew Purver (eds.), Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2664–2684, St. Julian’s, Malta, March 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.eacl-long.163.
  34. What does the Knowledge Neuron Thesis Have to do with Knowledge? In ICLR, October 2024. URL https://openreview.net/forum?id=2HJRwwbV3G.
  35. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114, 2021.
  36. OpenAI. Introducing ChatGPT, 2022. URL https://openai.com/blog/chatgpt.
  37. News categorization, framing and persuasion techniques: Annotation guidelines. Technical report, Technical report, European Commission Joint Research Centre, Ispra (Italy), 2023. URL https://knowledge4policy.ec.europa.eu/sites/default/files/JRC132862_technical_report_annotation_guidelines_final_with_affiliations_1.pdf.
  38. Robust stochastic graph generator for counterfactual explanations. AAAI, 2024. URL https://arxiv.org/abs/2312.11747.
  39. Why think step by step? Reasoning emerges from the locality of experience. Advances in Neural Information Processing Systems, 36, 2024.
  40. Can language models teach? teacher explanations improve student performance via personalization. Advances in Neural Information Processing Systems, 36, 2024.
  41. MuSR: Testing the limits of chain-of-thought with multistep soft reasoning. arXiv:2310.16049, 2023. URL https://arxiv.org/abs/2310.16049.
  42. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Jill Burstein, Christy Doran, and Thamar Solorio (eds.), NAACL-HLT, pp.  4149–4158, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1421. URL https://aclanthology.org/N19-1421.
  43. Solving olympiad geometry without human demonstrations. Nature, 625(7995):476–482, 2024.
  44. Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36, 2024.
  45. Using natural language explanations to rescale human judgments. arXiv preprint arXiv:2305.14770, 2023.
  46. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  47. Reframing human-AI collaboration for generating free-text explanations. In Marine Carpuat, Marie-Catherine de Marneffe, and Ivan Vladimir Meza Ruiz (eds.), Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  632–658, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.47. URL https://aclanthology.org/2022.naacl-main.47.
  48. WizardLM: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  49. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024.
  50. Xi Ye and Greg Durrett. The unreliability of explanations in few-shot prompting for textual reasoning. Advances in neural information processing systems, 35:30378–30392, 2022.
  51. How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs, 2024. URL https://arxiv.org/abs/2401.06373.
  52. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena, 2023.
  53. Is this the real life? Is this just fantasy? The Misleading Success of Simulating Social Interactions With LLMs, 2024. URL https://arxiv.org/abs/2403.05020.
  54. Situated natural language explanations. arXiv:2308.14115, 2023. URL https://arxiv.org/abs/2308.14115.
  55. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Rohan Ajwani (1 paper)
  2. Shashidhar Reddy Javaji (11 papers)
  3. Frank Rudzicz (90 papers)
  4. Zining Zhu (41 papers)
Citations (2)