Fine-Grained Self-Endorsement Improves Factuality and Reasoning (2402.15631v1)
Abstract: This work studies improving LLM generations at inference time by mitigating fact-conflicting hallucinations. Particularly, we propose a self-endorsement framework that leverages the fine-grained fact-level comparisons across multiple sampled responses. Compared with prior ensemble methods (Wang et al., 2022;Chen et al., 2023)) that perform response-level selection, our approach can better alleviate hallucinations, especially for longform generation tasks. Our approach can broadly benefit smaller and open-source LLMs as it mainly conducts simple content-based comparisons. Experiments on Biographies show that our method can effectively improve the factuality of generations with simple and intuitive prompts across different scales of LLMs. Besides, comprehensive analyses on TriviaQA and GSM8K demonstrate the potential of self-endorsement for broader application.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Chain-of-verification reduces hallucination in large language models.
- Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726.
- Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910.
- On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Large language models can self-improve. arXiv preprint arXiv:2210.11610.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
- Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
- Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
- Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
- Fine-tuning language models for factuality. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Math-shepherd: Verify and reinforce llms step-by-step without human annotations. CoRR, abs/2312.08935.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Eva-kellm: A new benchmark for evaluating knowledge editing of llms. arXiv preprint arXiv:2308.09954.
- Can llms express their uncertainty? an empirical evaluation of confidence elicitation in llms.
- Large language models as optimizers. arXiv preprint arXiv:2309.03409.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.