Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

One vs. Many: Comprehending Accurate Information from Multiple Erroneous and Inconsistent AI Generations (2405.05581v1)

Published 9 May 2024 in cs.HC, cs.AI, and cs.CL

Abstract: As LLMs are nondeterministic, the same input can generate different outputs, some of which may be incorrect or hallucinated. If run again, the LLM may correct itself and produce the correct answer. Unfortunately, most LLM-powered systems resort to single results which, correct or not, users accept. Having the LLM produce multiple outputs may help identify disagreements or alternatives. However, it is not obvious how the user will interpret conflicts or inconsistencies. To this end, we investigate how users perceive the AI model and comprehend the generated information when they receive multiple, potentially inconsistent, outputs. Through a preliminary study, we identified five types of output inconsistencies. Based on these categories, we conducted a study (N=252) in which participants were given one or more LLM-generated passages to an information-seeking question. We found that inconsistency within multiple LLM-generated outputs lowered the participants' perceived AI capacity, while also increasing their comprehension of the given information. Specifically, we observed that this positive effect of inconsistencies was most significant for participants who read two passages, compared to those who read three. Based on these findings, we present design implications that, instead of regarding LLM output inconsistencies as a drawback, we can reveal the potential inconsistencies to transparently indicate the limitations of these models and promote critical LLM usage.

Understanding AI Output Variance: Insights from Multiple Responses

The Impact of Multiple AI Outputs

Imagine you're using a LLM like ChatGPT to answer a complex question. Typically, you'd get one response and take it at face value. But what if you received multiple, potentially conflicting answers? Does this make you trust the AI less or dive deeper into the topic to understand better? Researchers have explored these questions by examining how different numbers of AI-generated responses and their consistency influence user perception of AI reliability and their understanding of the information presented.

Study Summary

Participants were divided into groups where they either saw one, two, or three AI-generated passages in response to an information-seeking question. Each group experienced varying degrees of consistency between the passages. The paper aimed to observe changes in participants' trust in the AI (perceived AI capacity) and their ability to understand the information provided (comprehension).

Key Findings on Perceived AI Capacity and Comprehension

  • Perceived AI Capacity: Inconsistencies between the passages generally decreased participants' trust in the AI. Interestingly, when given three passages, participants tended to rely on the majority answer, even if it was incorrect, suggesting that more information isn't always better for perceived accuracy.
  • Comprehension: Participants who received two slightly conflicting passages tended to understand the material better compared to those who received either one or three passages. This suggests that a moderate level of conflict may encourage deeper engagement with the content without overwhelming the reader.

Surprising Insights

The two-passage setup not only minimized blind trust in AI-generated content but also encouraged a more thorough evaluation of the information. However, the paper revealed that too much data (as in the three-passage scenario) can lead to confusion or reliance on potentially misleading majority opinions.

Implications for AI Design and Interaction

The findings suggest several design strategies for AI and machine learning systems:

  1. Presenting Multiple Perspectives: Offering two varying responses could foster a more critical assessment and engagement with AI-generated content.
  2. Transparency: Clearly indicating when responses are generated from AI and explaining why discrepancies may occur can help manage expectations and encourage a more analytical approach to AI interactions.
  3. Cognitive Load Management: Care must be taken not to overwhelm users with too much information, which could reduce the effectiveness of the AI interaction.

Future Research Directions

The paper prompts several questions for future research:

  • Beyond Text-Based Responses: Would these findings hold true for other forms of AI-generated content, such as images or videos?
  • Long-Term Interaction Effects: How does repeated exposure to consistent vs. inconsistent AI responses affect user trust and comprehension over time?
  • Impact of Initial Expectations: How does a user's prior belief about an AI's accuracy affect their response to consistency or lack thereof in AI outputs?

Understanding these dynamics can further refine how we design interactive AI systems that are both helpful and trustworthy, enhancing the human-AI interaction experience. Additionally, as AI continues to integrate into various aspects of daily life, adapting these findings to different contexts and user needs will be crucial in developing versatile, reliable AI tools.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Khan Academy. 2023. Khanmigo: Khan Academy’s AI-powered teaching assistant. Retrieved January 21, 2024 from https://blog.khanacademy.org/teacher-khanmigo
  2. Concrete Problems in AI Safety. ArXiv abs/1606.06565 (2016). https://api.semanticscholar.org/CorpusID:10242377
  3. Evaluating the Effects of Displaying Uncertainty in Context-Aware Applications. In Ubiquitous Computing. https://api.semanticscholar.org/CorpusID:2342122
  4. Akari Asai and Hannaneh Hajishirzi. 2020. Logic-Guided Data Augmentation and Regularization for Consistent Question Answering. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:216035859
  5. Updates in Human-AI Teams: Understanding and Addressing the Performance/Compatibility Tradeoff. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:53997192
  6. Open LLM Leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  7. Making things hard on yourself, but in a good way: Creating desirable difficulties to enhance learning. Psychology and the real world: Essays illustrating fundamental contributions to society 2, 59-68 (2011).
  8. Richard E. Boyatzis. 1998. Transforming Qualitative Information: Thematic Analysis and Code Development.
  9. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making. Proc. ACM Hum.-Comput. Interact. 5, CSCW1, Article 188 (apr 2021), 21 pages. https://doi.org/10.1145/3449287
  10. Make Up Your Mind! Adversarial Generation of Inconsistent Natural Language Explanations. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:203905467
  11. RELIC: Investigating Large Language Model Responses using Self-Consistency. ArXiv abs/2311.16842 (2023). https://api.semanticscholar.org/CorpusID:265466244
  12. Explaining Decision-Making Algorithms through UI: Strategies to Help Non-Expert Stakeholders. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:140281803
  13. Avishek Choudhury and Hamid Shamszare. 2023. Investigating the Impact of User Trust on the Adoption and Use of ChatGPT: Survey Analysis. Journal of Medical Internet Research 25 (2023). https://api.semanticscholar.org/CorpusID:258922988
  14. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv abs/1803.05457 (2018). https://api.semanticscholar.org/CorpusID:3922816
  15. LM vs LM: Detecting Factual Errors via Cross Examination. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:258833288
  16. Lynne M. Connelly. 2013. Grounded theory. Medsurg nursing : official journal of the Academy of Medical-Surgical Nurses 22 2 (2013), 124, 127.
  17. An Exploration of Location Error Estimation. In Ubiquitous Computing. https://api.semanticscholar.org/CorpusID:9535527
  18. Pierre Dragicevic. 2016. Fair Statistical Communication in HCI. https://api.semanticscholar.org/CorpusID:64470036
  19. Measuring and Improving Consistency in Pretrained Language Models. Transactions of the Association for Computational Linguistics 9 (2021), 1012–1031. https://api.semanticscholar.org/CorpusID:231740560
  20. QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization. In North American Chapter of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:245218667
  21. Who Goes First? Influences of Human-AI Workflow on Decision Making in Clinical Imaging. Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (2022). https://api.semanticscholar.org/CorpusID:248887734
  22. Krzysztof Z. Gajos and Lena Mamykina. 2022. Do People Engage Cognitively with AI? Impact of AI Assistance on Incidental Learning. In 27th International Conference on Intelligent User Interfaces (IUI ’22). Association for Computing Machinery, New York, NY, USA, 794–806. https://doi.org/10.1145/3490099.3511138
  23. Arthur C. Graesser and Natalie K. Person. 1994. Question Asking During Tutoring. https://api.semanticscholar.org/CorpusID:15485207
  24. Measuring Massive Multitask Language Understanding. ArXiv abs/2009.03300 (2020). https://api.semanticscholar.org/CorpusID:221516475
  25. Unsolved Problems in ML Safety. ArXiv abs/2109.13916 (2021). https://api.semanticscholar.org/CorpusID:238198240
  26. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ArXiv abs/2311.05232 (2023). https://api.semanticscholar.org/CorpusID:265067168
  27. Accurate, yet inconsistent? Consistency Analysis on Language Understanding Models. ArXiv abs/2108.06665 (2021). https://api.semanticscholar.org/CorpusID:237091303
  28. Survey of Hallucination in Natural Language Generation. Comput. Surveys 55 (2022), 1 – 38. https://api.semanticscholar.org/CorpusID:246652372
  29. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. ArXiv abs/1705.03551 (2017). https://api.semanticscholar.org/CorpusID:26501419
  30. Pamela Karr-Wisniewski and Ying Lu. 2010. When more is too much: Operationalizing technology overload and exploring its impact on knowledge worker productivity. Computers in Human Behavior 26, 5 (2010), 1061–1072.
  31. Conceptual Metaphors Impact Perceptions of Human-AI Collaboration. Proceedings of the ACM on Human-Computer Interaction 4 (2020), 1 – 26. https://api.semanticscholar.org/CorpusID:221005643
  32. Humans, AI, and Context: Understanding End-Users’ Trust in a Real-World Computer Vision Application. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 77–88. https://doi.org/10.1145/3593013.3593978
  33. Challenges and Opportunities of Moderating Usage of Large Language Models in Education. ArXiv abs/2312.14969 (2023). https://api.semanticscholar.org/CorpusID:266550959
  34. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 453–466. https://api.semanticscholar.org/CorpusID:86611921
  35. SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization. Transactions of the Association for Computational Linguistics 10 (2021), 163–177. https://api.semanticscholar.org/CorpusID:244345901
  36. Towards a Science of Human-AI Decision Making: An Overview of Design Space in Empirical Human-Subject Studies. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2023). https://api.semanticscholar.org/CorpusID:259139714
  37. Vivian Lai and Chenhao Tan. 2019. On Human Predictions with Explanations and Predictions of Machine Learning Models: A Case Study on Deception Detection. Proceedings of the Conference on Fairness, Accountability, and Transparency (2019). https://api.semanticscholar.org/CorpusID:53774958
  38. Interpretable Decision Sets: A Joint Framework for Description and Prediction. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2016). https://api.semanticscholar.org/CorpusID:12533380
  39. A taxonomy for software voting algorithms used in safety-critical systems. IEEE Transactions on Reliability 53, 3 (2004), 319–328. https://doi.org/10.1109/TR.2004.832819
  40. DAPIE: Interactive Step-by-Step Explanatory Dialogues to Answer Children’s Why and How Questions. Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (2023). https://api.semanticscholar.org/CorpusID:258216670
  41. From ChatGPT to FactGPT: A Participatory Design Study to Mitigate the Effects of Large Language Model Hallucinations on Users. In Proceedings of Mensch Und Computer 2023 (MuC ’23). Association for Computing Machinery, New York, NY, USA, 81–90. https://doi.org/10.1145/3603555.3603565
  42. The Dawn After the Dark: An Empirical Study on Factuality Hallucination in Large Language Models. https://api.semanticscholar.org/CorpusID:266844012
  43. Why and why not explanations improve the intelligibility of context-aware intelligent systems. Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (2009). https://api.semanticscholar.org/CorpusID:4507550
  44. TruthfulQA: Measuring How Models Mimic Human Falsehoods. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:237532606
  45. Teaching Models to Express Their Uncertainty in Words. Trans. Mach. Learn. Res. 2022 (2022). https://api.semanticscholar.org/CorpusID:249191391
  46. Diane M Mackie. 1987. Systematic and nonsystematic processing of majority and minority persuasive communications. Journal of Personality and Social Psychology 53, 1 (1987), 41.
  47. Erwin Marsi and Ferdi Van Rooden. 2007. Expressing uncertainty with a talking head in a multimodal question-answering system. https://api.semanticscholar.org/CorpusID:2482651
  48. Reducing Conversational Agents’ Overconfidence Through Linguistic Calibration. Transactions of the Association for Computational Linguistics 10 (2022), 857–872. https://doi.org/10.1162/tacl_a_00494
  49. Tim Miller. 2023. Explainable AI is Dead, Long Live Explainable AI! Hypothesis-driven Decision Support using Evaluative AI. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’23). Association for Computing Machinery, New York, NY, USA, 333–342. https://doi.org/10.1145/3593013.3594001
  50. FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. ArXiv abs/2305.14251 (2023). https://api.semanticscholar.org/CorpusID:258841470
  51. Explaining Explanations in AI. In Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* ’19). Association for Computing Machinery, New York, NY, USA, 279–288. https://doi.org/10.1145/3287560.3287574
  52. The Effects of Meaningful and Meaningless Explanations on Trust and Perceived System Accuracy in Intelligent Systems. In AAAI Conference on Human Computation & Crowdsourcing. https://api.semanticscholar.org/CorpusID:201639081
  53. OpenAI. 2022. Introducing ChatGPT. Retrieved January 17, 2024 from https://openai.com/blog/chatgpt
  54. Understanding the Impact of Explanations on Advice-Taking: A User Study for AI-Based Clinical Decision Support Systems. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 568, 9 pages. https://doi.org/10.1145/3491102.3502104
  55. How Accurate Does It Feel? – Human Perception of Different Types of Classification Mistakes. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 180, 13 pages. https://doi.org/10.1145/3491102.3501915
  56. AI Deception: A Survey of Examples, Risks, and Potential Solutions. arXiv:cs.CY/2308.14752
  57. KILT: a Benchmark for Knowledge Intensive Language Tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 2523–2544. https://doi.org/10.18653/v1/2021.naacl-main.200
  58. “I Think You Might Like This”: Exploring Effects of Confidence Signal Patterns on Trust in and Reliance on Conversational Recommender Systems. Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (2023). https://api.semanticscholar.org/CorpusID:259139753
  59. On the Systematicity of Probing Contextualized Word Representations: The Case of Hypernymy in BERT. In STARSEM. https://api.semanticscholar.org/CorpusID:227230677
  60. Are Red Roses Red? Evaluating Consistency of Question-Answering Models. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:196182403
  61. Anchors: High-Precision Model-Agnostic Explanations. In AAAI Conference on Artificial Intelligence. https://api.semanticscholar.org/CorpusID:3366554
  62. The impact of artificial intelligence on learner–instructor interaction in online learning. International journal of educational technology in higher education 18, 1 (2021), 1–23.
  63. Comparing Traditional and LLM-based Search for Consumer Choice: A Randomized Experiment. ArXiv abs/2307.03744 (2023). https://api.semanticscholar.org/CorpusID:259375527
  64. texSketch: Active Diagramming through Pen-and-Ink Annotations. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–13. https://doi.org/10.1145/3313831.3376155
  65. Sensecape: Enabling Multilevel Exploration and Sensemaking with Large Language Models. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (2023). https://api.semanticscholar.org/CorpusID:258822925
  66. Quantifying Uncertainty in Foundation Models via Ensembles. https://api.semanticscholar.org/CorpusID:254183614
  67. Med-HALT: Medical Domain Hallucination Test for Large Language Models. In Conference on Computational Natural Language Learning. https://api.semanticscholar.org/CorpusID:260316324
  68. Asking and Answering Questions to Evaluate the Factual Consistency of Summaries. ArXiv abs/2004.04228 (2020). https://api.semanticscholar.org/CorpusID:215548661
  69. People’s Perceptions Toward Bias and Related Concepts in Large Language Models: A Systematic Review. ArXiv abs/2309.14504 (2023). https://api.semanticscholar.org/CorpusID:262825989
  70. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ArXiv abs/2203.11171 (2022). https://api.semanticscholar.org/CorpusID:247595263
  71. Toward General Design Principles for Generative AI Applications 130-144. In IUI Workshops. https://api.semanticscholar.org/CorpusID:255825625
  72. Artificial intelligence, artificial teachers and the fate of learners in the 21st century education sector: Implications for theory and practice. International Journal of Pure and Applied Mathematics 119, 16 (2018), 2245–2259.
  73. Understanding the Effect of Accuracy on Trust in Machine Learning Models. Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (2019). https://api.semanticscholar.org/CorpusID:109927933
  74. You Complete Me: Human-AI Teams and Complementary Expertise. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems (CHI ’22). Association for Computing Machinery, New York, NY, USA, Article 114, 28 pages. https://doi.org/10.1145/3491102.3517791
  75. Effect of confidence and explanation on accuracy and trust calibration in AI-assisted decision making. Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (2020). https://api.semanticscholar.org/CorpusID:210023849
  76. A Survey of Large Language Models. arXiv:cs.CL/2303.18223
  77. Navigating the Grey Area: Expressions of Overconfidence and Uncertainty in Language Models. ArXiv abs/2302.13439 (2023). https://api.semanticscholar.org/CorpusID:257220189
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yoonjoo Lee (8 papers)
  2. Kihoon Son (7 papers)
  3. Tae Soo Kim (20 papers)
  4. Jisu Kim (43 papers)
  5. John Joon Young Chung (15 papers)
  6. Eytan Adar (20 papers)
  7. Juho Kim (56 papers)
Citations (1)