Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating the Role of Prompting and External Tools in Hallucination Rates of Large Language Models (2410.19385v1)

Published 25 Oct 2024 in cs.CL and cs.AI

Abstract: LLMs are powerful computational models trained on extensive corpora of human-readable text, enabling them to perform general-purpose language understanding and generation. LLMs have garnered significant attention in both industry and academia due to their exceptional performance across various NLP tasks. Despite these successes, LLMs often produce inaccuracies, commonly referred to as hallucinations. Prompt engineering, the process of designing and formulating instructions for LLMs to perform specific tasks, has emerged as a key approach to mitigating hallucinations. This paper provides a comprehensive empirical evaluation of different prompting strategies and frameworks aimed at reducing hallucinations in LLMs. Various prompting techniques are applied to a broad set of benchmark datasets to assess the accuracy and hallucination rate of each method. Additionally, the paper investigates the influence of tool-calling agents (LLMs augmented with external tools to enhance their capabilities beyond language generation) on hallucination rates in the same benchmarks. The findings demonstrate that the optimal prompting technique depends on the type of problem, and that simpler techniques often outperform more complex methods in reducing hallucinations. Furthermore, it is shown that LLM agents can exhibit significantly higher hallucination rates due to the added complexity of external tool usage.

Investigating Prompting and External Tools in LLM Hallucinations

The paper "Investigating the Role of Prompting and External Tools in Hallucination Rates of LLMs" explores methods to reduce inaccuracies, known as hallucinations, in LLMs. This research is pertinent due to the increasing deployment of LLMs across various applications, where hallucinations can lead to significant misinformation, particularly in sensitive domains like politics or medicine.

Hallucinations in LLMs

LLMs are known for their linguistic capabilities but suffer from hallucinations—outputs that are unfaithful to real-world facts or context. These hallucinations are categorized into factual and faithfulness hallucinations. Factual hallucinations include inconsistencies or unsupported fabrications, while faithfulness hallucinations involve deviations from prompt instructions or logic.

Prompt Engineering Techniques

The paper evaluates multiple prompting techniques designed to mitigate hallucinations:

  1. Self-Consistency (SC): This technique employs majority voting across multiple samples to enhance reliability. It showed effectiveness particularly at a higher temperature (0.8) in the GSM8K benchmark, which involves mathematical reasoning.
  2. Chain-of-Thought (CoT) and Tree-of-Thought (ToT): These approaches break down problem-solving into logical steps. The results indicate that while they improve reasoning, SC performed better due to its balance between creativity and accuracy.
  3. Chat Protect (CP): By discarding answers where contradictions are detected among multiple samples, CP achieved the highest accuracy on the TriviaQA benchmark. Its performance improved at higher temperatures by reducing the number of hallucinations.
  4. Knowledge Graph-based Retrofitting (KGR) and DuckDuckGo Augmentation (DDGA): The DDGA method, which adds real-time internet information to queries, improved the number of correct answers. However, incorporation of knowledge graphs like Wikidata in KGR didn't yield similar benefits due to implementation limitations.
  5. Multiagent Debate (MAD): This interactive approach showed promising results, particularly on the MMLU benchmark with diverse subjects. The debate model allows for refining answers through agent interaction.
  6. Reflection and Chain-of-Verification (CoVe): These methods involve critical feedback and verification questions to confirm the consistency of responses. CoVe resulted in high accuracy but at the expense of reduced question coverage.

Influence of External Tools

The paper highlights that augmenting LLMs with external tools, such as through the ReAct framework, introduces complexity that can increase hallucination rates, particularly in less powerful models. The research underscores that simpler architectures often outperform more intricate setups due to reduced cognitive load on the model.

Implications and Future Directions

The findings of this paper suggest that the effectiveness of mitigation strategies highly depends on the type of task. SC is particularly effective for reasoning tasks, while strategies like CP cater well to factual question settings. The introduction of external tools requires careful handling to avoid reducing performance quality.

Future research could explore combinations of strategies or dynamic adjustment techniques, such as adaptive temperature settings in SC, tailored to specific tasks. Moreover, the exploration of larger, more powerful models with external tools presents another avenue for research, potentially reducing the hallucination rates observed with smaller models.

This paper provides valuable insights into mitigating LLM hallucinations using prompting techniques and highlights the nuanced considerations when employing external tools in AI systems. The results significantly contribute to developing more reliable AI applications across a broad range of domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. X. Amatriain. 2024. Measuring and Mitigating Hallucinations in Large Language Models: A Multifaceted Approach.
  2. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168 [cs.LG]
  3. Chain-of-Verification Reduces Hallucination in Large Language Models. arXiv:2309.11495 [cs.CL]
  4. B. H. Dowden. 1993. Logical reasoning. Wadsworth, Sacramento.
  5. Improving Factuality and Reasoning in Language Models through Multiagent Debate. arXiv:2305.14325 [cs.CL] Preprint.
  6. Mitigating Large Language Model Hallucinations via Autonomous Knowledge Graph-based Retrofitting. arXiv:2311.13314 [cs.CL] https://arxiv.org/abs/2311.13314
  7. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY] https://arxiv.org/abs/2009.03300
  8. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. arXiv:2311.05232 [cs.CL]
  9. Survey of Hallucination in Natural Language Generation. Comput. Surveys 55, 12 (March 2023), 1–38. https://doi.org/10.1145/3571730
  10. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Vancouver, Canada, 1601––1611.
  11. Large Language Models: A Survey. arXiv:2402.06196 [cs.CL]
  12. M. Minsky. 1988. The Society of Mind. Simon & Schuster, New York. 97–101 pages.
  13. O. Mortensen. 2024. How many users does ChatGPT have? Statistics & facts (2024). https://seo.ai/blog/how-many-users-does-chatgpt-have#:~:text=How%20Many%20Users%20on%20ChatGPT,boasts%20approximately%20180.5%20million%20users. Accessed: 24 September 2024.
  14. Self-contradictory Hallucinations of Large Language Models: Evaluation, Detection and Mitigation. In The Twelfth International Conference on Learning Representations. OpenReview, Virtual/Online. https://openreview.net/forum?id=EmQSOi1X2f
  15. Attention is All you Need. In Advances in Neural Information Processing Systems, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  16. D. Vrandecic and M. Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85.
  17. Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]
  18. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903 [cs.CL]
  19. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601 [cs.CL]
  20. ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629 [cs.CL] https://arxiv.org/abs/2210.03629
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Liam Barkley (1 paper)
  2. Brink van der Merwe (6 papers)
Citations (1)