Improving Model Factuality with Fine-grained Critique-based Evaluator (2410.18359v3)
Abstract: Factuality evaluation aims to detect factual errors produced by LMs and hence guide the development of more factual models. Towards this goal, we train a factuality evaluator, FenCE, that provides LM generators with claim-level factuality feedback. We conduct data augmentation on a combination of public judgment datasets to train FenCE to (1) generate textual critiques along with scores and (2) make claim-level judgment based on diverse source documents obtained by various tools. We then present a framework that leverages FenCE to improve the factuality of LM generators by constructing training data. Specifically, we generate a set of candidate responses, leverage FenCE to revise and score each response without introducing lesser-known facts, and train the generator by preferring highly scored revised responses. Experiments show that our data augmentation methods improve the evaluator's accuracy by 2.9% on LLM-AggreFact. With FenCE, we improve Llama2-7B-chat and Llama3-8B-chat's factuality rate by 16.86% and 14.45% on FActScore, outperforming state-of-the-art factuality finetuning methods by 8.83% and 6.96%.
- Ms marco: A human generated machine reading comprehension dataset. Preprint, arXiv:1611.09268.
- Factool: Factuality detection in generative ai – a tool augmented framework for multi-task and multi-domain scenarios. Preprint, arXiv:2307.13528.
- Visual programming for step-by-step text-to-image generation and evaluation. In Thirty-seventh Conference on Neural Information Processing Systems.
- Dola: Decoding by contrasting layers improves factuality in large language models. In The Twelfth International Conference on Learning Representations.
- Ultrafeedback: Boosting language models with scaled ai feedback. Preprint, arXiv:2310.01377.
- Wizard of wikipedia: Knowledge-powered conversational agents. Preprint, arXiv:1811.01241.
- FaithDial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10.
- Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10.
- Does fine-tuning llms on new knowledge encourage hallucinations? Preprint, arXiv:2405.05904.
- Understanding finetuning for factual knowledge extraction. In Forty-first International Conference on Machine Learning.
- CRITIC: Large language models can self-correct with tool-interactive critiquing. In The Twelfth International Conference on Learning Representations.
- q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Large language models can self-improve. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051–1068, Singapore. Association for Computational Linguistics.
- Tigerscore: Towards building explainable metric for all text generation tasks. Preprint, arXiv:2310.00752.
- Ever: Mitigating hallucination in large language models through real-time verification and rectification. Preprint, arXiv:2311.09114.
- Unfamiliar finetuning examples control how language models hallucinate. Preprint, arXiv:2403.05612.
- Prometheus: Inducing fine-grained evaluation capability in language models. In The Twelfth International Conference on Learning Representations.
- Internet-augmented language models through few-shot prompting for open-domain question answering. Preprint, arXiv:2203.05115.
- Neural text generation from structured data with application to the biography domain. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations.
- The dawn after the dark: An empirical study on factuality hallucination in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10879–10899, Bangkok, Thailand. Association for Computational Linguistics.
- HaluEval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Inference-time intervention: eliciting truthful answers from a language model. In Proceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY, USA.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
- On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1906–1919, Online.
- Teaching language models to support answers with verified quotes. Preprint, arXiv:2203.11147.
- FActScore: Fine-grained atomic evaluation of factual precision in long form text generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12076–12100, Singapore. Association for Computational Linguistics.
- Fine-grained hallucination detection and editing for language models. In First Conference on Language Modeling.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- RAGTruth: A hallucination corpus for developing trustworthy retrieval-augmented language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22. Curran Associates Inc.
- Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4812–4829, Online. Association for Computational Linguistics.
- Talm: Tool augmented language models. Preprint, arXiv:2205.12255.
- Tool learning with foundation models. Preprint, arXiv:2304.08354.
- Toolllm: Facilitating large language models to master 16000+ real-world apis. Preprint, arXiv:2307.16789.
- Direct preference optimization: Your language model is secretly a reward model. Preprint, arXiv:2305.18290.
- Self-critiquing models for assisting human evaluators. Preprint, arXiv:2206.05802.
- Toolformer: Language models can teach themselves to use tools. In Thirty-seventh Conference on Neural Information Processing Systems.
- Get your vitamin C! robust fact verification with contrastive evidence. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics.
- Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics.
- Trusting your evidence: Hallucinate less with context-aware decoding. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers), pages 783–791, Mexico City, Mexico. Association for Computational Linguistics.
- Minicheck: Efficient fact-checking of llms on grounding documents. Preprint, arXiv:2404.10774.
- Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. Preprint, arXiv:2306.05301.
- Fine-tuning language models for factuality. In The Twelfth International Conference on Learning Representations.
- Foundational autoraters: Taming large language models for better automatic evaluation. Preprint, arXiv:2407.10817.
- Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
- Shepherd: A critic for language model generation. Preprint, arXiv:2308.04592.
- Long-form factuality in large language models. Preprint, arXiv:2403.18802.
- Generating sequences by learning to self-correct. In The Eleventh International Conference on Learning Representations.
- Doclens: Multi-aspect fine-grained evaluation for medical text generation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics.
- Pride and prejudice: Llm amplifies self-bias in self-refinement. Preprint, arXiv:2402.11436.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. Preprint, arXiv:2309.01219.
- A dataset for document grounded conversations. Preprint, arXiv:1809.07358.
- Fine-tuning language models from human preferences. Preprint, arXiv:1909.08593.