Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HILL: A Hallucination Identifier for Large Language Models (2403.06710v1)

Published 11 Mar 2024 in cs.HC

Abstract: LLMs are prone to hallucinations, i.e., nonsensical, unfaithful, and undesirable text. Users tend to overrely on LLMs and corresponding hallucinations which can lead to misinterpretations and errors. To tackle the problem of overreliance, we propose HILL, the "Hallucination Identifier for LLMs". First, we identified design features for HILL with a Wizard of Oz approach with nine participants. Subsequently, we implemented HILL based on the identified design features and evaluated HILL's interface design by surveying 17 participants. Further, we investigated HILL's functionality to identify hallucinations based on an existing question-answering dataset and five user interviews. We find that HILL can correctly identify and highlight hallucinations in LLM responses which enables users to handle LLM responses with more caution. With that, we propose an easy-to-implement adaptation to existing LLMs and demonstrate the relevance of user-centered designs of AI artifacts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. An empirical evaluation of the system usability scale. International Journal of Human-Computer Interaction 24, 6 (2008), 574–594.
  2. Emily Bell. 2023. A fake news frenzy: why ChatGPT could be disastrous for truth in journalism. https://www.theguardian.com/commentisfree/2023/mar/03/fake-news-chatgpt-truth-journalism-disinformation
  3. Ali Borji. 2023. A Categorical Archive of ChatGPT Failures.
  4. Raj Chandra Bose. 1939. On the construction of balanced incomplete block designs. Annals of Eugenics 9, 4 (1939), 353–399.
  5. Virginia Braun and Victoria Clarke. 2006. Using thematic analysis in psychology. Qualitative Research in Psychology 3, 2 (Jan. 2006), 77–101. https://doi.org/10.1191/1478088706qp063oa
  6. Dmitri Brereton. 2023. Bing AI Can’t Be Trusted. https://dkb.blog/p/bing-ai-cant-be-trusted
  7. OpenAI API — openai.com. https://openai.com/blog/openai-api. [Accessed 25-08-2023].
  8. John Brooke. 1996. Sus: a “quick and dirty’usability. Usability evaluation in industry 189, 3 (1996), 189–194.
  9. Extracting Training Data from Large Language Models.. In Proceedings of the 30th USENIX Security Symposium, Vol. 6. The USENIX Association, 2633–2650.
  10. Steve Cohen et al. 2003. Maximum difference scaling: improved measures of importance and preference for segmentation. In Sawtooth Software Conference Proceedings, Vol. 530. Sawtooth Software, Inc. Fir St., Sequim, WA, 61–74.
  11. Martin Coulter and Greg Bensinger. 2023. Alphabet shares dive after Google AI chatbot Bard flubs answer in ad. Reuters (Feb. 2023). https://www.reuters.com/technology/google-ai-chatbot-bard-offers-inaccurate-information-company-ad-2023-02-08/
  12. Vue.js developers. 2014. Vue.js - The Progressive JavaScript Framework v3.0. https://vuejs.org/guide/introduction.html Accessed: 2023-08-25.
  13. Wizard of Oz support throughout an iterative design process. IEEE Pervasive Computing 4, 4 (2005), 18–26.
  14. “Garbage In, Garbage Out”: Mitigating Human Biases in Data Entry by Means of Artificial Intelligence. Springer Nature Switzerland, 27–48. https://doi.org/10.1007/978-3-031-42286-7_2
  15. Adam Finn and Jordan J. Louviere. 1992. Determining the Appropriate Response to Evidence of Public Concern: The Case of Food Safety. Journal of Public Policy & Marketing 11, 2 (Sept. 1992), 12–25. https://doi.org/10.1177/074391569201100202
  16. Norman M Fraser and G Nigel Gilbert. 1991. Simulating speech systems. Computer Speech & Language 5, 1 (1991), 81–99.
  17. Boris A Galitsky. 2023. Truth-O-Meter: Collaborating with LLM in Fighting its Hallucinations.
  18. RARR: Researching and Revising What Language Models Say, Using Language Models.
  19. Google. 2021. LaMDA: our breakthrough conversation technology. https://blog.google/technology/ai/lamda/
  20. Hallucinations in Large Multilingual Translation Models. arXiv:2303.16104 [cs.CL]
  21. Andreas Holzinger. 2018. From machine learning to explainable AI. In 2018 World Symposium on Digital Intelligence for Systems and Machines (DISA). IEEE, 55–66.
  22. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  23. Joseph Kahne and Benjamin Bowyer. 2018. The political significance of social media activity and social networks. Political Communication 35, 3 (2018), 470–493.
  24. Towards a science of human-ai decision making: a survey of empirical studies.
  25. John D Lee and Katrina A See. 2004. Trust in automation: Designing for appropriate reliance. Human Factors 46, 1 (2004), 50–80.
  26. From ChatGPT to FactGPT: A Participatory Design Study to Mitigate the Effects of Large Language Model Hallucinations on Users. In Proceedings of Mensch und Computer 2023 (MuC ’23). ACM, Rapperswil, Switzerland, 81–90.
  27. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896 (2023).
  28. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020).
  29. Michael D Myers and Michael Newman. 2007. The qualitative interview in IS research: Examining the craft. Information and Organization 17, 1 (2007), 2–26.
  30. Elon Musk and others urge ai pause, citing ’risks to society’. https://www.reuters.com/technology/musk-experts-urge-pause-training-ai-systems-that-can-outperform-gpt-4-2023-03-29/ Publication Title: Reuters.
  31. OpenAI. 2023a. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
  32. OpenAI. 2023b. Introducing ChatGPT. https://openai.com/blog/chatgpt
  33. Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27730–27744.
  34. Raja Parasuraman and Victor Riley. 1997. Humans and automation: Use, misuse, disuse, abuse. Human factors 39, 2 (1997), 230–253.
  35. Samir Passi and Mihaela Vorvoreanu. 2022. Overreliance on AI Literature Review. Microsoft Research (2022).
  36. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813 (2023).
  37. Know What You Don’t Know: Unanswerable Questions for SQuAD. CoRR abs/1806.03822 (2018). arXiv:1806.03822 http://arxiv.org/abs/1806.03822
  38. The Curious Case of Hallucinations in Neural Machine Translation. arXiv:2104.06683 [cs.CL]
  39. Laurel D Riek. 2012. Wizard of Oz studies in HRI: A Systematic Review and New Reporting Guidelines. Journal of Human-Robot Interaction 1, 1 (2012), 119–136.
  40. Michael Schade. 2021. What Does the Official ChatGPT iOS App Icon Look Like? — OpenAI Help Center — help.openai.com. https://help.openai.com/en/articles/7905742-what-does-the-official-chatgpt-ios-app-icon-look-like. [Accessed 25-08-2023].
  41. Appropriate reliance on AI advice: Conceptualization and the effect of explanations. In Proceedings of the 28th International Conference on Intelligent User Interfaces. 410–422.
  42. Jessica Shieh. 2023. Best practices for prompt engineering with OpenAI API — OpenAI Help Center — help.openai.com. https://help.openai.com/en/articles/6654000-best-practices-for-prompt-engineering-with-openai-api. [Accessed 29-08-2023].
  43. Ben Shneiderman. 2020. Human-centered artificial intelligence: Reliable, safe & trustworthy. International Journal of Human-Computer Interaction 36, 6 (2020), 495–504.
  44. Trustworthy artificial intelligence. Electronic Markets 31, 2 (Oct. 2020), 447–464. https://doi.org/10.1007/s12525-020-00441-4
  45. What is the Minimum to Trust AI?—A Requirement Analysis for (Generative) AI-based Texts. In Internationale Tagung Wirtschaftsinformatik 2023.
  46. James Vincent. 2023. Google and Microsoft’s chatbots are already citing one another in a misinformation shitshow. https://www.theverge.com/2023/3/22/23651564/google-microsoft-bard-bing-chatbots-misinformation
  47. Informed Machine Learning – A Taxonomy and Survey of Integrating Prior Knowledge into Learning Systems. IEEE Transactions on Knowledge and Data Engineering 35, 1 (2023), 614–633. https://doi.org/10.1109/TKDE.2021.3079836
  48. A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT. arXiv:2302.11382 [cs.SE]
  49. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. https://doi.org/10.48550/ARXIV.2304.13712
  50. Why Johnny can’t prompt: how non-AI experts try (and fail) to design LLM prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems. 1–21.
  51. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219 (2023).
  52. Reducing Quantity Hallucinations in Abstractive Summarization. arXiv:2009.13312 [cs.CL]
  53. Analyzing and Mitigating Object Hallucination in Large Vision-Language Models. arXiv:2310.00754 [cs.LG]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Florian Leiser (3 papers)
  2. Sven Eckhardt (2 papers)
  3. Valentin Leuthe (1 paper)
  4. Merlin Knaeble (1 paper)
  5. Alexander Maedche (5 papers)
  6. Gerhard Schwabe (15 papers)
  7. Ali Sunyaev (16 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com