Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents (2404.10774v2)

Published 16 Apr 2024 in cs.CL and cs.AI
MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents

Abstract: Recognizing if LLM output can be grounded in evidence is central to many tasks in NLP: retrieval-augmented generation, summarization, document-grounded dialogue, and more. Current approaches to this kind of fact-checking are based on verifying each piece of a model generation against potential evidence using an LLM. However, this process can be very computationally expensive, requiring many calls to a model to check a single response. In this work, we show how to build small fact-checking models that have GPT-4-level performance but for 400x lower cost. We do this by constructing synthetic training data with GPT-4, which involves creating realistic yet challenging instances of factual errors via a structured generation procedure. Training on this data teaches models to check each fact in the claim and recognize synthesis of information across sentences. For evaluation, we unify datasets from recent work on fact-checking and grounding LLM generations into a new benchmark, LLM-AggreFact. Our best system MiniCheck-FT5 (770M parameters) outperforms all systems of comparable size and reaches GPT-4 accuracy. We release LLM-AggreFact, code for data synthesis, and models.

Efficient and Effective Fact-Checking for Grounding LLM Generations

Introduction

LLMs hold remarkable capacities for generating fluent and contextually relevant text across a myriad of tasks including document summarization, dialogue generation, and more. Nevertheless, these models often falter by producing content that, while seemingly plausible, may not be factually corroborated by evidence — a phenomenon known as "hallucination." Addressing this challenge, especially in a scalable and cost-effective manner, remains of interest within the field of NLP.

The present work introduces an innovative methodology that significantly mitigates the computational and financial overhead involved in LLM-based fact-checking without compromising on performance quality. By crafting a novel synthetic dataset that mimics complex instances of factual inaccuracies and leveraging this dataset to train a smaller model architecture, the authors showcase a system, MiniCheck, that rivals the accuracy of GPT-4 while operating at 400 times lower cost.

Fact-Checking Model Integration

MiniCheck, the proposed system, exemplifies a notable leap in addressing the limitations of prior fact-checking approaches. At its core, MiniCheck employs a sophisticated training regimen using synthetic data that is purposefully designed to include a range of factual inaccuracies. This data simulates the multifaceted nature of errors LLMs might generate, from misinterpretations to outright factual mistakes, across sentences that demand multi-sentence reasoning for verification.

The structure of MiniCheck is grounded in the Flan-T5 architecture, enriched through fine-tuning on the synthetic dataset alongside tailoring to standard entailment tasks. This methodological choice ensures that MiniCheck not only grasps the nuances of LLM-generated text but also aligns with the broader entailment detection capabilities required for effective fact-checking.

LLM-AggreFact: A New Factual Evaluation Benchmark

To benchmark the proficiency of fact-checking models, including MiniCheck, the paper introduces LLM-AggreFact — a comprehensive dataset amalgamating various tasks that necessitate evidence grounding. This benchmark encompasses a diverse array of domains from healthcare to news, alongside a mixture of closed-book and grounded generation settings, offering a rigorous testing ground for fact-checking systems.

Evaluation on LLM-AggreFact reveals that MiniCheck outperforms previous systems by a significant margin in terms of balanced accuracy. Specifically, MiniCheck-FT5, with 770M parameters, showcases comparative accuracy to GPT-4 while being significantly more efficient in terms of both speed and cost.

Implications and Future Directions

The findings presented carry both practical and theoretical implications for the development and deployment of LLMs. Practically, MiniCheck offers a viable solution for integrating robust fact-checking mechanisms into LLM applications without incurring prohibitive costs. Theoretically, the use of synthetic data for training fact-checkers opens new avenues for model training, particularly in scenarios where error types are complex and diverse.

Speculatively, as LLMs continue to evolve, the role of efficient and effective fact-checking will undeniably become more critical. Future research may explore extending the MiniCheck approach to multilingual settings, addressing the challenge of multi-document reasoning for comprehensive fact-checking, and further optimizing the trade-off between model size, accuracy, and operational costs.

Conclusion

Through meticulous methodology, synthetic data generation, and comprehensive benchmarking, this work advances the state of fact-checking for LLM-generated content. MiniCheck demonstrates that precision in fact-checking can be achieved without the constraints of high computational costs, offering a forward-looking solution for researchers and practitioners aiming to enhance the reliability of LLM outputs across a spectrum of applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (73)
  1. Evaluating correctness and faithfulness of instruction-following models for question answering. arXiv preprint arXiv:2307.16877.
  2. Constitutional ai: Harmlessness from ai feedback.
  3. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  4. Hallucinated but factual! inspecting the factuality of hallucinations in abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3340–3354, Dublin, Ireland. Association for Computational Linguistics.
  5. CLIFF: Contrastive learning for improving faithfulness and factuality in abstractive summarization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6633–6649, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  6. Understanding retrieval augmentation for long-form question answering. arXiv preprint arXiv:2310.12150.
  7. Benchmarking large language models in retrieval-augmented generation. In Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pages 17754–17762. AAAI Press.
  8. The dangers of trusting stochastic parrots: Faithfulness and trust in open-domain conversational question answering. In Findings of the Association for Computational Linguistics: ACL 2023, pages 947–959, Toronto, Canada. Association for Computational Linguistics.
  9. Decontextualization: Making sentences stand-alone. Transactions of the Association for Computational Linguistics, 9:447–461.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  11. QAFactEval: Improved QA-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601, Seattle, United States. Association for Computational Linguistics.
  12. SummEval: Re-evaluating Summarization Evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  13. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220, Florence, Italy. Association for Computational Linguistics.
  14. ELI5: Long form question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3558–3567, Florence, Italy. Association for Computational Linguistics.
  15. RARR: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16477–16508, Toronto, Canada. Association for Computational Linguistics.
  16. Self-verification improves few-shot clinical information extraction. In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH).
  17. AmbiFC: Fact-Checking Ambiguous Claims with Evidence. Transactions of the Association for Computational Linguistics, 12:1–18.
  18. Tanya Goyal and Greg Durrett. 2020. Evaluating factuality in generation with dependency-level entailment. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3592–3603, Online. Association for Computational Linguistics.
  19. Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1449–1462, Online. Association for Computational Linguistics.
  20. Language models hallucinate, but may excel at fact verification. arXiv preprint arXiv:2310.14564.
  21. Annotation artifacts in natural language inference data. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 107–112, New Orleans, Louisiana. Association for Computational Linguistics.
  22. DeBERTaV3: Improving DeBERTa using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543.
  23. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pages 161–175, Dublin, Ireland. Association for Computational Linguistics.
  24. Distilling step-by-step! outperforming larger language models with less training data and smaller model sizes. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8003–8017, Toronto, Canada. Association for Computational Linguistics.
  25. MeetingBank: A benchmark dataset for meeting summarization. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16409–16423, Toronto, Canada. Association for Computational Linguistics.
  26. What have we achieved on text summarization? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 446–469, Online. Association for Computational Linguistics.
  27. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.
  28. A chain-of-thought is as strong as its weakest link: A benchmark for verifiers of reasoning chains.
  29. Mixtral of experts.
  30. Multilingual simplification of medical texts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16662–16692, Singapore. Association for Computational Linguistics.
  31. WiCE: Real-world entailment for claims in Wikipedia. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7561–7583, Singapore. Association for Computational Linguistics.
  32. Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346, Online. Association for Computational Linguistics.
  33. SummEdits: Measuring LLM ability at factual reasoning through the lens of summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676, Singapore. Association for Computational Linguistics.
  34. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
  35. Linguistically-informed transformations (LIT): A method for automatically generating contrast sets. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 126–135, Online. Association for Computational Linguistics.
  36. HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, Singapore. Association for Computational Linguistics.
  37. Evaluating verifiability in generative search engines. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7001–7025, Singapore. Association for Computational Linguistics.
  38. RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  39. BRIO: Bringing order to abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2890–2903, Dublin, Ireland. Association for Computational Linguistics.
  40. ChatGPT as a Factual Inconsistency Evaluator for Text Summarization.
  41. ExpertQA: Expert-curated questions and attributed answers. In 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics.
  42. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  43. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online. Association for Computational Linguistics.
  44. Sources of hallucination by large language models on inference tasks. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2758–2774, Singapore. Association for Computational Linguistics.
  45. FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
  46. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning. Association for Computational Linguistics.
  47. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807, Brussels, Belgium. Association for Computational Linguistics.
  48. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  49. OpenAI. 2023. GPT-4 technical report. ArXiv, abs/2303.08774.
  50. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
  51. Improving Wikipedia verifiability with AI. Nature Machine Intelligence, 5(10):1142–1148.
  52. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922.
  53. Minds versus machines: Rethinking entailment verification with language models. arXiv preprint arXiv:2402.03686.
  54. Summarizing, simplifying, and synthesizing medical evidence using GPT-3 (with varying success). In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1387–1407, Toronto, Canada. Association for Computational Linguistics.
  55. Understanding factual errors in summarization: Errors, summarizers, datasets, error detectors. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11626–11644, Toronto, Canada. Association for Computational Linguistics.
  56. Less likely brainstorming: Using language models to generate alternative hypotheses. In Findings of the Association for Computational Linguistics: ACL 2023, pages 12532–12555, Toronto, Canada. Association for Computational Linguistics.
  57. TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization.
  58. Evaluating large language models on medical evidence summarization. npj Digital Medicine, 6(1).
  59. Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca.
  60. Gemini: A family of highly capable multimodal models.
  61. LaMDA: Language Models for Dialog Applications. ArXiv, abs/2201.08239.
  62. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
  63. Factcheck-GPT: End-to-End Fine-Grained Document-Level Fact-Checking and Correction of LLM Output. ArXiv, abs/2311.09000.
  64. Large language models are better reasoners with self-verification. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2550–2575, Singapore. Association for Computational Linguistics.
  65. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  66. Xiaoyu Yang and Xiaodan Zhu. 2021. Exploring decomposition for table-based fact verification. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1045–1052, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  67. DocNLI: A large-scale dataset for document-level natural language inference. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 4913–4922, Online. Association for Computational Linguistics.
  68. AlignScore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
  69. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
  70. Benchmarking large language models for news summarization. Transactions of the Association for Computational Linguistics, 12:39–57.
  71. Siren’s song in the AI ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
  72. Context-faithful prompting for large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 14544–14556, Singapore. Association for Computational Linguistics.
  73. MediaSum: A large-scale media interview dataset for dialogue summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5927–5934, Online. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Liyan Tang (12 papers)
  2. Philippe Laban (40 papers)
  3. Greg Durrett (117 papers)
Citations (46)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Youtube Logo Streamline Icon: https://streamlinehq.com