RELIC: Investigating Large Language Model Responses using Self-Consistency (2311.16842v2)
Abstract: LLMs are notorious for blending fact with fiction and generating non-factual content, known as hallucinations. To address this challenge, we propose an interactive system that helps users gain insight into the reliability of the generated text. Our approach is based on the idea that the self-consistency of multiple samples generated by the same LLM relates to its confidence in individual claims in the generated texts. Using this idea, we design RELIC, an interactive system that enables users to investigate and verify semantic-level variations in multiple long-form responses. This allows users to recognize potentially inaccurate information in the generated text and make necessary corrections. From a user study with ten participants, we demonstrate that our approach helps users better verify the reliability of the generated text. We further summarize the design implications and lessons learned from this research for future studies of reliable human-LLM interactions.
- Ask Me Anything: A simple strategy for prompting language models. In The Eleventh International Conference on Learning Representations.
- Uncertainty as a form of transparency: Measuring, communicating, and using uncertainty. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 401–413.
- Mapping the Design Space of Human-AI Interaction in Text Summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 431–455.
- Avishek Choudhury and Hamid Shamszare. 2023. Investigating the Impact of User Trust on the Adoption and Use of ChatGPT: Survey Analysis. Journal of Medical Internet Research 25 (2023), 0.
- Beyond Text Generation: Supporting Writers with Continuous Automatic Text Summaries. In Proceedings of the 35th Annual ACM Symposium on User Interface Software and Technology (Bend, OR, USA) (UIST ’22). Article 98, 13 pages.
- Zijian Ding and Joel Chan. 2023. Mapping the Design Space of Interactions in Human-AI Text Co-creation Tasks. arXiv:2303.06430
- Is ChatGPT a Highly Fluent Grammatical Error Correction System? A Comprehensive Evaluation. arXiv:2304.01746
- Bridging the gap: A survey on integrating (human) feedback for natural language generation. arXiv:2305.00955
- Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics 8 (2020), 539–555.
- Exploring the placement and design of word-scale visualizations. IEEE Transactions on Visualization and Computer Graphics 20, 12 (2014), 2291–2300.
- CRITIC: Large language models can self-correct with tool-interactive critiquing. arXiv:2305.11738
- Creative Writing with an AI-Powered Writing Assistant: Perspectives from Professional Writers. arXiv:2211.05030
- Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 55, 12, Article 248 (2023), 38 pages.
- Promptmaker: Prompt-based prototyping with large language models. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–8.
- Graphologue: Exploring Large Language Model Responses with Interactive Diagrams. arXiv:2305.11473
- How can we know when language models know? on the calibration of language models for question answering. Transactions of the Association for Computational Linguistics.
- Language Models (Mostly) Know What They Know. arXiv:2207.05221
- Large Language Models Struggle to Learn Long-Tail Knowledge. Proceedings of the 40th International Conference on Machine Learning (2023).
- Semantic Uncertainty: Linguistic Invariances for Uncertainty Estimation in Natural Language Generation. In The Eleventh International Conference on Learning Representations.
- Coauthor: Designing a human-ai collaborative writing dataset for exploring language model capabilities. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–19.
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. arXiv:2005.11401
- Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods. arXiv:2203.05227
- Q Vera Liao and Jennifer Wortman Vaughan. 2023. AI Transparency in the Age of LLMs: A Human-Centered Research Roadmap. arXiv:2306.01941
- Zachary C Lipton. 2018. The mythos of model interpretability: In machine learning, the concept of interpretability is both important and slippery. Queue 16, 3 (2018), 31–57.
- SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models. arXiv:2303.08896
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. arXiv:2305.14251
- Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation. arXiv:2305.15852
- Anglekindling: Supporting journalistic angle ideation with large language models. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–16.
- Improving Wikipedia Verifiability with AI.
- ”Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.
- Visual comparison of language model adaptation. IEEE Transactions on Visualization and Computer Graphics 29, 1 (2022), 1178–1188.
- Rita Sevastjanova and Mennatallah El-Assady. 2022. Beware the rationalization trap! when language model explainability diverges from our mental models of language. arXiv:2207.06897
- LMFingerprints: Visual explanations of language model embedding spaces through layerwise contextualization scores. In Computer Graphics Forum. Wiley Online Library, 295–307.
- explAIner: A visual analytics framework for interactive and explainable machine learning. IEEE Transactions on Visualization and computer graphics 26, 1 (2019), 1064–1074.
- Seq2seq-vis: A visual debugging tool for sequence-to-sequence models. IEEE transactions on visualization and computer graphics 25, 1 (2018), 353–363.
- Lstmvis: A tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics 24, 1 (2017), 667–676.
- Interactive and visual prompt engineering for ad-hoc task adaptation with large language models. IEEE transactions on visualization and computer graphics 29, 1 (2022), 1146–1156.
- Sensecape: Enabling Multilevel Exploration and Sensemaking with Large Language Models. arXiv:2305.11483
- The language interpretability tool: Extensible, interactive visualizations and analysis for NLP models. arXiv:2008.05122
- FEVER: A Large-scale Dataset for Fact Extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 809–819. https://doi.org/10.18653/v1/N18-1074
- Do NLP Models Know Numbers? Probing Numeracy in Embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processin.
- Designing theory-driven user-centric explainable AI. In Proceedings of the 2019 CHI conference on human factors in computing systems. 1–15.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems 35 (2022), 24824–24837.
- Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts. 1–10.
- AI chains: Transparent and controllable human-AI interaction by chaining large language model prompts. In Proceedings of the 2022 CHI conference on human factors in computing systems. 1–22.
- AI as an Active Writer: Interaction strategies with generated text in human-AI collaborative fiction writing. In Joint Proceedings of the ACM IUI Workshops, Vol. 10. CEUR-WS Team.
- It is AI’s Turn to Ask Humans a Question: Question-Answer Pair Generation for Children’s Story Books. arXiv:2109.03423
- Wordcraft: Story Writing With Large Language Models. In 27th International Conference on Intelligent User Interfaces (Helsinki, Finland) (IUI ’22). 841–852. https://doi.org/10.1145/3490099.3511105
- Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
- Michael J.Q. Zhang and Eunsol Choi. 2021. Situatedqa: Incorporating extra-linguistic contexts into QA. Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processin (2021).
- Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. In FAT* ’20: Conference on Fairness, Accountability, and Transparency Barcelona, Spain, January 27-30, 2020, Mireille Hildebrandt, Carlos Castillo, L. Elisa Celis, Salvatore Ruggieri, Linnet Taylor, and Gabriela Zanfir-Fortuna (Eds.). 295–305.
- Furui Cheng (10 papers)
- Vilém Zouhar (41 papers)
- Simran Arora (64 papers)
- Mrinmaya Sachan (124 papers)
- Hendrik Strobelt (43 papers)
- Mennatallah El-Assady (54 papers)