Ever: Mitigating Hallucination in Large Language Models through Real-Time Verification and Rectification (2311.09114v2)
Abstract: LLMs have demonstrated remarkable proficiency in generating fluent text. However, they often encounter the challenge of generating inaccurate or hallucinated content. This issue is common in both non-retrieval-based generation and retrieval-augmented generation approaches, and existing post-hoc rectification methods may not address the accumulated hallucination errors that may be caused by the "snowballing" issue, especially in reasoning tasks. To tackle these challenges, we introduce a novel approach called Real-time Verification and Rectification (Ever). Instead of waiting until the end of the generation process to rectify hallucinations, Ever employs a real-time, step-wise generation and hallucination rectification strategy. The primary objective is to detect and rectify hallucinations as they occur during the text generation process. When compared to both retrieval-based and non-retrieval-based baselines, Ever demonstrates a significant improvement in generating trustworthy and factually accurate text across a diverse range of tasks, including short-form QA, biography generation, and multi-hop reasoning.
- Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248.
- Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511.
- Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39(3/4):324–345.
- Factool: Factuality detection in generative ai–a tool augmented framework for multi-task and multi-domain scenarios. arXiv preprint arXiv:2307.13528.
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883.
- Lm vs lm: Detecting factual errors via cross examination. arXiv preprint arXiv:2305.13281.
- Chain-of-verification reduces hallucination in large language models. arXiv preprint arXiv:2309.11495.
- Halo: Estimation and reduction of hallucinations in open-source weak large language models. arXiv preprint arXiv:2308.11764.
- Attributed text generation via post-hoc research and revision. arXiv preprint arXiv:2210.08726.
- Pal: Program-aided language models. In International Conference on Machine Learning, pages 10764–10799. PMLR.
- Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.
- Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Active retrieval augmented generation. arXiv preprint arXiv:2305.06983.
- Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551.
- Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221.
- Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341.
- Contrastive decoding: Open-ended text generation as optimization. arXiv preprint arXiv:2210.15097.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896.
- Teaching language models to support answers with verified quotes. arXiv preprint arXiv:2203.11147.
- Reducing conversational agents’ overconfidence through linguistic calibration. Transactions of the Association for Computational Linguistics, 10:857–872.
- A discrete hard EM approach for weakly supervised question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2851–2864, Hong Kong, China. Association for Computational Linguistics.
- Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Trusting your evidence: Hallucinate less with context-aware decoding. arXiv preprint arXiv:2305.14739.
- Head-to-tail: How knowledgeable are large language models (llm)? aka will llms replace knowledge graphs? arXiv preprint arXiv:2308.10168.
- Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987.
- Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- " according to…" prompting language models improves quoting from pre-training data. arXiv preprint arXiv:2305.13252.
- Generating natural language proofs with verifier-guided search. arXiv preprint arXiv:2205.12443.
- Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Improving language models via plug-and-play retrieval feedback. arXiv preprint arXiv:2305.14002.
- How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534.
- Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219.
- Verify-and-edit: A knowledge-enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625.
- Analyzing and mitigating object hallucination in large vision-language models. arXiv preprint arXiv:2310.00754.
- Haoqiang Kang (7 papers)
- Juntong Ni (10 papers)
- Huaxiu Yao (103 papers)