GenAudit: Fixing Factual Errors in Language Model Outputs with Evidence (2402.12566v2)
Abstract: LLMs can generate factually incorrect statements even when provided access to reference documents. Such errors can be dangerous in high-stakes applications (e.g., document-grounded QA for healthcare or finance). We present GenAudit -- a tool intended to assist fact-checking LLM responses for document-grounded tasks. GenAudit suggests edits to the LLM response by revising or removing claims that are not supported by the reference document, and also presents evidence from the reference for facts that do appear to have support. We train models to execute these tasks, and design an interactive interface to present suggested edits and evidence to users. Comprehensive evaluation by human raters shows that GenAudit can detect errors in 8 different LLM outputs when summarizing documents from diverse domains. To ensure that most errors are flagged by the system, we propose a method that can increase the error recall while minimizing impact on precision. We release our tool (GenAudit) and fact-checking model for public use.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- A meta-evaluation of faithfulness metrics for long-form hospital-course summarization. In Machine Learning for Healthcare Conference, pages 2–30. PMLR.
- The falcon series of open language models. arXiv preprint arXiv:2311.16867.
- Incorporating discrete translation lexicons into neural machine translation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Faithful to the original: Fact aware neural abstractive summarization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Qafacteval: Improved qa-based factual consistency evaluation for summarization. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2587–2601.
- Tanya Goyal and Greg Durrett. 2021. Annotating and modeling fine-grained factuality in summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1449–1462.
- The curious case of neural text degeneration. In International Conference on Learning Representations.
- Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- WiCE: Real-world entailment for claims in Wikipedia. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7561–7583, Singapore. Association for Computational Linguistics.
- Neel Kanwal and Giuseppe Rizzo. 2022. Attention-based clinical note summarization. In Proceedings of the 37th ACM/SIGAPP Symposium on Applied Computing, pages 813–820.
- Abstractive summarization of reddit posts with multi-level memory networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2519–2531.
- USB: A unified summarization benchmark across tasks and domains. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8826–8845.
- Evaluating the factual consistency of abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9332–9346.
- Summedits: Measuring llm ability at factual reasoning through the lens of summarization. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676.
- Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
- On improving summarization factual consistency from natural language feedback. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15144–15161, Toronto, Canada. Association for Computational Linguistics.
- Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1797–1807.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- Delucionqa: Detecting hallucinations in domain-specific question answering. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 822–835.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1073–1083.
- Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Fine-tuning language models for factuality. In ICLR.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific Data, 10(1):586.
- Kundan Krishna (14 papers)
- Sanjana Ramprasad (6 papers)
- Prakhar Gupta (31 papers)
- Byron C. Wallace (82 papers)
- Jeffrey P. Bigham (48 papers)
- Zachary C. Lipton (137 papers)