Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study (2404.07060v1)

Published 10 Apr 2024 in cs.CL and cs.LG

Abstract: We present an empirical study of groundedness in long-form question answering (LFQA) by retrieval-augmented LLMs. In particular, we evaluate whether every generated sentence is grounded in the retrieved documents or the model's pre-training data. Across 3 datasets and 4 model families, our findings reveal that a significant fraction of generated sentences are consistently ungrounded, even when those sentences contain correct ground-truth answers. Additionally, we examine the impacts of factors such as model size, decoding strategy, and instruction tuning on groundedness. Our results show that while larger models tend to ground their outputs more effectively, a significant portion of correct answers remains compromised by hallucinations. This study provides novel insights into the groundedness challenges in LFQA and underscores the necessity for more robust mechanisms in LLMs to mitigate the generation of ungrounded content.

Groundedness in Retrieval-augmented Long-form Generation: An Empirical Study

The paper under discussion provides a methodical exploration of groundedness in retrieval-augmented LLMs tasked with Long-form Question Answering (LFQA). This research sheds light on the intricate issue of whether LLMs can faithfully ground their generated responses in provided documents, or if they default to hallucinations even when producing answers that tally with ground-truth data.

The paper meticulously evaluates the grounding of individual sentences within model outputs across multiple datasets and model families, focusing on a nuanced distinction between grounding in retrieved documents versus the vastness of pre-training corpora. Notably, the paper underscores that even when LLMs generate factually correct sentences, a considerable portion remains ungrounded in the provided or pre-training materials. This raises critical questions about the internal mechanisms that enable or inhibit proper grounding in such models.

Significant findings include the revelation that larger models generally produce more grounded responses. Nevertheless, the results demonstrate that model size alone is insufficient to eliminate ungrounded statements entirely. This issue persists even in the largest models explored, such as Falcon 180B, where up to 25% of seemingly correct outputs are derived from hallucinated content. The dependency on strategies beyond increasing model size becomes apparent—a crucial insight for practitioners aiming to enhance LLM reliability.

Moreover, the effect of different factors like decoding strategies and instruction tuning on groundedness was rigorously assessed. The results indicate that beam search decoding, unlike greedy or nucleus sampling, consistently yielded outputs that are more anchored in the provided context, suggesting that this strategy may inherently facilitate better content alignment with source materials. Instruction tuning also appears to have a positive role, enhancing the groundedness of models considerably.

From a methodological standpoint, the authors adapted a mixed retrieval strategy, combining retrieval from external documents with a post-generation search across the pre-training corpus. The analysis employs an inference-based grounding model to check whether model outputs could be empirically supported, either by retrieved or pre-trained corpus documents, illuminating the intertwined relationship between groundedness and model pre-training.

In discussing the broader implications, the paper emphasizes the necessity for more robust mitigation mechanisms against hallucination in LLMs. The potential of developing more sophisticated retrieval-augmented frameworks or fine-tuning strategies to enhance sentence-level alignment with factual data is immense and could significantly impact applications demanding high veracity, such as academic research synthesis and automated Q&A systems.

Looking forward, the research opens avenues for extensive exploration into specialized decoding strategies or post-processing corrections to check grounding, aiming for refined methodologies that can effectively counteract the inherent limitations of existing LLMs. Practitioners might find benefit in exploring adaptive models that dynamically verify grounding during generation, rather than post hoc.

In conclusion, this paper provides empirical grounding to the challenges and dynamics of grounded content generation in retrieval-augmented LLMs. It underscores the essential nature of continued research and development in this domain to ensure the dependable deployment of powerful LLMs across various real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248.
  2. Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in ChatGPT: Implications in scientific writing. Cureus, 15(2).
  3. Exploring the boundaries of reality: Investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. Cureus, 15(4).
  4. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR.
  5. Attributed question answering: Evaluation and modeling for attributed large language models. arXiv preprint arXiv:2212.08037.
  6. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  7. Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150.
  8. KCTS: Knowledge-Constrained tree search decoding with token-level hallucination detection. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14035–14053, Singapore. Association for Computational Linguistics.
  9. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5055–5070, Online. Association for Computational Linguistics.
  10. Measuring causal effects of data statistics on language model’sfactual’predictions. arXiv preprint arXiv:2207.14251.
  11. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  12. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  13. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  14. Enabling large language models to generate text with citations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6465–6488, Singapore. Association for Computational Linguistics.
  15. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
  16. Xiaochuang Han and Yulia Tsvetkov. 2022. Orca: Interpreting prompted language models via locating supporting data evidence in the ocean of pretraining data. arXiv preprint arXiv:2205.12600.
  17. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751.
  18. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 3905–3920, Seattle, United States. Association for Computational Linguistics.
  19. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7856–7870, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
  21. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299.
  22. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  23. Billion-scale similarity search with gpus. IEEE Transactions on Big Data, 7(3):535–547.
  24. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
  25. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4940–4957, Online. Association for Computational Linguistics.
  26. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
  27. A survey of large language models attribution. arXiv preprint arXiv:2311.03731.
  28. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  29. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
  30. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  31. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  32. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  33. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics, 11:652–670.
  34. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448, Florence, Italy. Association for Computational Linguistics.
  35. Silo language models: Isolating legal risk in a nonparametric datastore. arXiv preprint arXiv:2308.04430.
  36. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
  37. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  38. The RefinedWeb dataset for Falcon LLM: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  40. Measuring attribution in natural language generation models. Computational Linguistics, pages 1–66.
  41. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
  42. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  43. Prompting GPT-3 to be reliable. In The Eleventh International Conference on Learning Representations.
  44. ASQA: Factoid questions meet long-form answers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 8273–8288, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  45. Read before generate! faithful long form question answering with machine reading. In Findings of the Association for Computational Linguistics: ACL 2022, pages 744–756, Dublin, Ireland. Association for Computational Linguistics.
  46. MosaicML NLP Team. 2023. Introducing MPT-7B: A new standard for open-source, commercially usable LLMs.
  47. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  48. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems, 33:5776–5788.
  49. “according to . . . ”: Prompting language models improves quoting from pre-training data. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2288–2301, St. Julian’s, Malta. Association for Computational Linguistics.
  50. A critical evaluation of evaluations for long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3225–3245, Toronto, Canada. Association for Computational Linguistics.
  51. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
  52. Automatic evaluation of attribution by large language models. arXiv preprint arXiv:2305.06311.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Alessandro Stolfo (12 papers)
Citations (2)