Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Characterizing Large Language Models as Rationalizers of Knowledge-intensive Tasks (2311.05085v2)

Published 9 Nov 2023 in cs.CL and cs.AI

Abstract: LLMs are proficient at generating fluent text with minimal task-specific supervision. Yet, their ability to provide well-grounded rationalizations for knowledge-intensive tasks remains under-explored. Such tasks, like commonsense multiple-choice questions, require rationales based on world knowledge to support predictions and refute alternate options. We consider the task of generating knowledge-guided rationalization in natural language by using expert-written examples in a few-shot manner. Surprisingly, crowd-workers preferred knowledge-grounded rationales over crowdsourced rationalizations, citing their factuality, sufficiency, and comprehensive refutations. Although LLMs-generated rationales were preferable, further improvements in conciseness and novelty are required. In another study, we show how rationalization of incorrect model predictions erodes humans' trust in LLM-generated rationales. Motivated by these observations, we create a two-stage pipeline to review task predictions and eliminate potential incorrect decisions before rationalization, enabling trustworthy rationale generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Explanations for CommonsenseQA: New Dataset and Models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Online. Association for Computational Linguistics.
  2. Fairwashing: the risk of rationalization. In International Conference on Machine Learning, pages 161–170. PMLR.
  3. Can explainable ai explain unfairness? a framework for evaluating explainable ai. arXiv preprint arXiv:2106.07483.
  4. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 610–623, New York, NY, USA. Association for Computing Machinery.
  5. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  7. e-snli: Natural language inference with natural language explanations. Advances in Neural Information Processing Systems, 31.
  8. A study of automatic metrics for the evaluation of natural language explanations. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2376–2387, Online. Association for Computational Linguistics.
  9. Statistical knowledge assessment for large language models. In Thirty-seventh Conference on Neural Information Processing Systems.
  10. Data quality in online human-subjects research: Comparisons between mturk, prolific, cloudresearch, qualtrics, and sona. Plos one, 18(3):e0279720.
  11. Rationalization: A neural machine translation approach to generating natural language explanations. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society, pages 81–87.
  12. Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764.
  13. Scalable multi-hop relational reasoning for knowledge-aware question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1295–1309, Online. Association for Computational Linguistics.
  14. Rationalization for explainable nlp: A survey. arXiv preprint arXiv:2301.08912.
  15. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
  16. Generating visual explanations. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 3–19. Springer.
  17. Kevin Anthony Hoff and Masooda Bashir. 2015. Trust in automation: Integrating empirical evidence on factors that influence trust. Human factors, 57(3):407–434.
  18. Metrics for explainable ai: Challenges and prospects. arXiv preprint arXiv:1812.04608.
  19. Can large language models explain themselves? a study of llm-generated self-explanations. arXiv preprint arXiv:2310.11207.
  20. e-vil: A dataset and benchmark for natural language explanations in vision-language tasks. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1244–1254.
  21. John King and Roger Magoulas. 2015. 2015 data science salary survey. O’Reilly Media, Incorporated, Sebastopol, CA, USA.
  22. Klaus Krippendorff. 2011. Computing krippendorff’s alpha-reliability.
  23. Internet-augmented language models through few-shot prompting for open-domain question answering. arXiv preprint arXiv:2203.05115.
  24. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  25. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  26. Post-hoc interpretability for neural nlp: A survey. ACM Computing Surveys, 55(8):1–42.
  27. Knowledge-grounded self-rationalization via extractive and natural language explanations. arXiv preprint arXiv:2106.13876.
  28. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  29. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
  30. Few-shot self-rationalization with natural language prompts. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 410–424, Seattle, United States. Association for Computational Linguistics.
  31. Few-shot self-rationalization with natural language prompts. arXiv preprint arXiv:2111.08284.
  32. Foveate, attribute, and rationalize: Towards physically safe and trustworthy ai. In Findings of the Association for Computational Linguistics: ACL 2023, pages 11021–11036.
  33. Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789.
  34. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
  35. Wt5?! training text-to-text models to explain their predictions. arXiv preprint arXiv:2004.14546.
  36. Multimodal explanations: Justifying decisions and pointing to the evidence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8779–8788.
  37. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
  38. Pouya Pezeshkpour. 2023. Measuring and modifying factual knowledge in large language models. arXiv preprint arXiv:2306.06264.
  39. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  40. Explain yourself! leveraging language models for commonsense reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4932–4942, Florence, Italy. Association for Computational Linguistics.
  41. Comet-22: Unbabel-ist 2022 submission for the metrics shared task. In Proceedings of the Seventh Conference on Machine Translation (WMT), pages 578–585.
  42. A recipe for arbitrary text style transfer with large language models. arXiv preprint arXiv:2109.03910.
  43. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  44. A meta-analysis of factors influencing the development of trust in automation: Implications for understanding autonomy in future systems. Human factors, 58(3):377–400.
  45. Timo Schick and Hinrich Schütze. 2020. Exploiting cloze questions for few shot text classification and natural language inference. arXiv preprint arXiv:2001.07676.
  46. Green ai. Communications of the ACM, 63(12):54–63.
  47. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
  48. No explainability without accountability: An empirical study of explanations and feedback in interactive ml. In Proceedings of the 2020 chi conference on human factors in computing systems, pages 1–13.
  49. Charles Spearman. 1987. The proof and measurement of association between two things. The American journal of psychology, 100(3/4):441–471.
  50. Conceptnet 5.5: An open multilingual graph of general knowledge. In Proceedings of the AAAI conference on artificial intelligence, volume 31.
  51. Sage advice? the impacts of explanations for machine learning models on human decision-making in spam detection. In International Conference on Human-Computer Interaction, pages 269–284. Springer.
  52. CommonsenseQA: A question answering challenge targeting commonsense knowledge. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4149–4158, Minneapolis, Minnesota. Association for Computational Linguistics.
  53. Chenhao Tan. 2021. On the diversity and limits of human explanations. arXiv preprint arXiv:2106.11988.
  54. Connecting the dots: A knowledgeable path generator for commonsense question answering. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4129–4140, Online. Association for Computational Linguistics.
  55. Scott: Self-consistent chain-of-thought distillation. arXiv preprint arXiv:2305.01879.
  56. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
  57. Gender, productivity, and prestige in computer science faculty hiring networks. In Proceedings of the 25th International Conference on World Wide Web, pages 1169–1179, New York, NY, USA. ACM.
  58. {{\{{MLaaS}}\}} in the wild: Workload analysis and scheduling in {{\{{Large-Scale}}\}} heterogeneous {{\{{GPU}}\}} clusters. In 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22), pages 945–960.
  59. Reframing human-AI collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658, Seattle, United States. Association for Computational Linguistics.
  60. Sarah Wiegreffe and Ana Marasović. 2021. Teach me to explain: A review of datasets for explainable natural language processing. arXiv preprint arXiv:2102.12060.
  61. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  62. Sustainable ai: Environmental implications, challenges and opportunities. Proceedings of Machine Learning and Systems, 4:795–813.
  63. Less is more for long document summary evaluation by llms. arXiv preprint arXiv:2309.07382.
  64. Instructscore: Towards explainable text generation evaluation with automatic feedback. arXiv preprint arXiv:2305.14282.
  65. Learning contextualized knowledge structures for commonsense reasoning. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4038–4051, Online. Association for Computational Linguistics.
  66. Deep bidirectional language-knowledge graph pretraining. Advances in Neural Information Processing Systems, 35:37309–37323.
  67. QA-GNN: Reasoning with language models and knowledge graphs for question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 535–546, Online. Association for Computational Linguistics.
  68. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  69. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Aditi Mishra (7 papers)
  2. Sajjadur Rahman (16 papers)
  3. Hannah Kim (19 papers)
  4. Kushan Mitra (4 papers)
  5. Estevam Hruschka (23 papers)
Citations (4)