Papers
Topics
Authors
Recent
2000 character limit reached

Measuring and Reducing LLM Hallucination without Gold-Standard Answers (2402.10412v2)

Published 16 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLM hallucination, i.e. generating factually incorrect yet seemingly convincing answers, is currently a major threat to the trustworthiness and reliability of LLMs. The first step towards solving this complicated problem is to measure it. However, existing hallucination metrics require having a benchmark dataset with gold-standard answers, i.e. "best" or "correct" answers written by humans. Such requirements make hallucination measurement costly and prone to human errors. In this work, we propose Factualness Evaluations via Weighting LLMs (FEWL), an innovative hallucination metric that is specifically designed for the scenario when gold-standard answers are absent. FEWL leverages the answers from off-the-shelf LLMs that serve as a proxy of gold-standard answers. The key challenge is how to quantify the expertise of reference LLMs resourcefully. We show FEWL has certain theoretical guarantees and demonstrate empirically it gives more accurate hallucination measures than naively using reference LLMs. We also show how to leverage FEWL to reduce hallucination through both in-context learning and supervised fine-tuning. Extensive experiment results on Truthful-QA, CHALE, and HaluEval datasets demonstrate the effectiveness of FEWL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. Large language models associate muslims with violence. Nature Machine Intelligence, 3(6):461–463, 2021.
  2. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  3. Learning with instance-dependent label noise: A sample sieve approach. arXiv preprint arXiv:2010.02347, 2020.
  4. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  5. Katja Filippova. Controlled hallucinations: Learning to generate faithfully from noisy data. arXiv preprint arXiv:2010.05873, 2020.
  6. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  7. Are we modeling the task or the annotator? an investigation of annotator bias in natural language understanding datasets. arXiv preprint arXiv:1908.07898, 2019.
  8. Assessing the factual accuracy of generated text. In proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 166–175, 2019.
  9. Human-instruction-free llm self-alignment with limited samples. arXiv preprint arXiv:2401.06785, 2024.
  10. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. arXiv preprint arXiv:2104.08202, 2021.
  11. Simon Hughes. Cut the bull: Detecting hallucinations in large language models. https://vectara.com, 11 2023.
  12. Social biases in nlp models as barriers for persons with disabilities. arXiv preprint arXiv:2005.00813, 2020.
  13. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  14. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213, 2022.
  15. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  16. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  17. Early-learning regularization prevents memorization of noisy labels. Advances in neural information processing systems, 33:20331–20342, 2020.
  18. Classification with noisy labels by importance reweighting. IEEE Transactions on pattern analysis and machine intelligence, 38(3):447–461, 2015.
  19. Peer loss functions: Learning from noisy labels without knowing noise rates. In International conference on machine learning, pages 6226–6236. PMLR, 2020.
  20. Auditing for federated learning: A model elicitation approach. In Proceedings of the Fifth International Conference on Distributed Artificial Intelligence, pages 1–9, 2023.
  21. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment. arXiv preprint arXiv:2308.05374, 2023.
  22. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
  23. Learning with noisy labels. Advances in neural information processing systems, 26, 2013.
  24. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
  25. A simple recipe towards reducing hallucination in neural surface realisation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2673–2679, 2019.
  26. Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70:1373–1411, 2021.
  27. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016.
  28. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  29. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  30. The curious case of hallucinations in neural machine translation. arXiv preprint arXiv:2104.06683, 2021.
  31. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  32. Object hallucination in image captioning. arXiv preprint arXiv:1809.02156, 2018.
  33. A neural conversational model. arXiv preprint arXiv:1506.05869, 2015.
  34. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095, 2023.
  35. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  36. To smooth or not? when label smoothing meets noisy labels. In International Conference on Machine Learning, pages 23589–23614. PMLR, 2022.
  37. When optimizing $f$-divergence is robust with label noise. In International Conference on Learning Representations, 2021.
  38. Learning with noisy labels revisited: A study using real-world human annotations. In International Conference on Learning Representations, 2022.
  39. To aggregate or not? learning with separate noisy labels. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pages 2523–2535, 2023.
  40. Taxonomy of risks posed by language models. In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pages 214–229, 2022.
  41. Neural text generation with unlikelihood training. arXiv preprint arXiv:1908.04319, 2019.
  42. Robust early-learning: Hindering the memorization of noisy labels. In International conference on learning representations, 2020.
  43. Learning from massive noisy labeled data for image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2691–2699, 2015.
  44. DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales. arXiv preprint arXiv:2308.01320, 2023.
  45. Ideal: Influence-driven selective annotations empower in-context learners in large language models. arXiv preprint arXiv:2310.10873, 2023.
  46. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  47. Detecting corrupted labels without training a model to predict. In International conference on machine learning, pages 27412–27427. PMLR, 2022.
  48. Clusterability as an alternative to anchor points when learning with noisy labels. In International Conference on Machine Learning, pages 12912–12923. PMLR, 2021.
  49. Unmasking and improving data credibility: A study with datasets for training harmless language models. arXiv preprint arXiv:2311.11202, 2023.
Citations (14)

Summary

We haven't generated a summary for this paper yet.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: