Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation (2310.18794v3)

Published 28 Oct 2023 in cs.CL and cs.AI
Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation

Abstract: In this work, we propose sequence-level certainty as a common theme over hallucination in Knowledge Grounded Dialogue Generation (KGDG). We explore the correlation between the level of hallucination in model responses and two types of sequence-level certainty: probabilistic certainty and semantic certainty. Empirical results reveal that higher levels of both types of certainty in model responses are correlated with lower levels of hallucination. We further propose Certainty-based Response Ranking (CRR), a decoding-time hallucination mitigation method that samples several response candidates, ranks them based on sequence-level certainty, and outputs the response with the highest certainty level. Aligning with our definitions of sequence-level certainty, we design 2 types of CRR approaches: Probabilistic CRR (P-CRR) and Semantic CRR (S-CRR). P-CRR ranks individually sampled model responses using the arithmetic mean log-probability of the entire sequence. S-CRR approaches certainty estimation from meaning-space, and ranks model response candidates based on their semantic certainty level as measured by an entailment-based Agreement Score (AS). Through extensive experiments across 3 KGDG datasets, 3 decoding methods, and 4 KGDG models, we validate the effectiveness of CRR for reducing hallucination in KGDG task.

Sequence-Level Certainty Reduces Hallucination in Knowledge-Grounded Dialogue Generation

The paper "Sequence-Level Certainty Reduces Hallucination In Knowledge-Grounded Dialogue Generation" offers an insightful investigation into the mitigation of hallucinations within Knowledge Grounded Dialogue Generation (KGDG) systems by exploring the concept of sequence-level certainty. The research delineates two distinct forms of sequence-level certainty: probabilistic and semantic certainty. The paper establishes a correlation between these certainties and hallucination levels in dialogue models. Additionally, the authors introduce Certainty-based Response Ranking (CRR) decoding methods, Probabilistic CRR (P-CRR), and Semantic CRR (S-CRR), which demonstrate efficacy in reducing hallucinations across multiple datasets and models.

Key Findings

  1. Sequence-Level Certainty and Hallucination: The research demonstrates that increased sequence-level certainty correlates with reduced hallucination in model outputs. Probabilistic certainty is evaluated by measuring the arithmetic mean of log-probability over an entire sequence, whereas semantic certainty relies on an Agreement Score (AS), which assesses semantic entailment among candidate responses.
  2. CRR Decoding Methods: Two decoding strategies are introduced: P-CRR and S-CRR. P-CRR ranks candidate responses based on probabilistic certainty, effectively prioritizing more likely sequences. S-CRR uses entailment-based semantic analysis to achieve similar goals but focuses on the semantic reliability of generated sequences.
  3. Empirical Validation: Extensive experimentation across three KGDG datasets and four diverse models (GPT2-small, GPT2-medium, T5-base, and OpenLlama-3B) substantiates the effectiveness of both CRR methods. Results consistently show that outputs generated using CRR approaches demonstrate significantly lower hallucination rates compared to traditional decoding methods.

Implications and Future Directions

The insights from this paper are significant for the development of more reliable and coherent dialogue systems. The proposed models address the issue of hallucination—a critical hurdle in model deployment for practical applications. By integrating sequence-level certainty, future dialogue generation models can better align generated responses with input knowledge, thus enhancing user satisfaction and trust.

Potential future research avenues include exploring the integration of CRR methods with more advanced generative models, such as those leveraging large pre-trained LLMs or multi-modal inputs. Additionally, extending these concepts to other NLG tasks, like abstractive summarization or machine translation, could prove beneficial. Further research could also involve refining CRR approaches to optimize computational efficiency, given the increased resource demands during candidate ranking processes.

Overall, this paper provides a detailed and comprehensive approach to reducing hallucinations in dialogue systems, offering a pathway for researchers to develop more accurate and faithful language generation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. Focus attention: Promoting faithfulness and diversity in summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  6078–6095, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.474. URL https://aclanthology.org/2021.acl-long.474.
  2. Constrained decoding for neural NLG from compositional representations in task-oriented dialogue. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  831–844, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1080. URL https://aclanthology.org/P19-1080.
  3. Incorporating external knowledge into machine reading for generative question answering. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2521–2530, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1255. URL https://aclanthology.org/D19-1255.
  4. R. BISIANI. Beam search. Encyclopedia of Artificial Intelligence, 1992. URL https://cir.nii.ac.jp/crid/1574231875360981248.
  5. Factual error correction for abstractive summarization models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  6251–6258, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.506. URL https://aclanthology.org/2020.emnlp-main.506.
  6. Faithful to the original: Fact aware neural abstractive summarization. ArXiv, abs/1711.04434, 2017. URL https://api.semanticscholar.org/CorpusID:19198109.
  7. Improving faithfulness in abstractive summarization with contrast candidate generation and selection. arXiv preprint arXiv:2104.09061, 2021.
  8. Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883, 2023.
  9. Handling divergent reference texts when evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4884–4895, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1483. URL https://aclanthology.org/P19-1483.
  10. Wizard of Wikipedia: Knowledge-powered conversational agents. In Proceedings of the International Conference on Learning Representations (ICLR), 2019.
  11. Multi-fact correction in abstractive text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  9320–9331, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.749. URL https://aclanthology.org/2020.emnlp-main.749.
  12. FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5055–5070, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.454. URL https://aclanthology.org/2020.acl-main.454.
  13. Semantic noise matters for neural natural language generation. In Proceedings of the 12th International Conference on Natural Language Generation, pp.  421–426, Tokyo, Japan, October–November 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-8652. URL https://aclanthology.org/W19-8652.
  14. Neural path hunter: Reducing hallucination in dialogue systems via path grounding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  2197–2214, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.168. URL https://aclanthology.org/2021.emnlp-main.168.
  15. FaithDial: A Faithful Benchmark for Information-Seeking Dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490, 12 2022a. doi: 10.1162/tacl˙a˙00529.
  16. Evaluating attribution in dialogue systems: The BEGIN benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083, 2022b. doi: 10.1162/tacl˙a˙00506. URL https://aclanthology.org/2022.tacl-1.62.
  17. Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  889–898, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1082. URL https://aclanthology.org/P18-1082.
  18. Using local knowledge graph construction to scale Seq2Seq models to multi-document inputs. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4186–4196, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1428. URL https://aclanthology.org/D19-1428.
  19. Katja Filippova. Controlled hallucinations: Learning to generate faithfully from noisy data. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp.  864–870, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.76. URL https://aclanthology.org/2020.findings-emnlp.76.
  20. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  179–188, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1017. URL https://aclanthology.org/P17-1017.
  21. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  22. Deep learning, volume 1. MIT Press, 2016.
  23. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pp.  1891–1895, 2019. doi: 10.21437/Interspeech.2019-3079. URL http://dx.doi.org/10.21437/Interspeech.2019-3079.
  24. Mind the facts: Knowledge-boosted coherent abstractive text summarization. ArXiv, abs/2006.15435, 2020. URL https://api.semanticscholar.org/CorpusID:204735695.
  25. The curious case of neural text degeneration. ArXiv, abs/1904.09751, 2019. URL https://api.semanticscholar.org/CorpusID:127986954.
  26. q2superscript𝑞2q^{2}italic_q start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT: Evaluating factual consistency in knowledge-grounded dialogues via question generation and question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7856–7870, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.619. URL https://aclanthology.org/2021.emnlp-main.619.
  27. TRUE: Re-evaluating factual consistency evaluation. In Proceedings of the Second DialDoc Workshop on Document-grounded Dialogue and Conversational Question Answering, pp.  161–175, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.dialdoc-1.19. URL https://aclanthology.org/2022.dialdoc-1.19.
  28. Knowledge graph-augmented abstractive summarization with semantic-driven cloze reward. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  5094–5107, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.457. URL https://aclanthology.org/2020.acl-main.457.
  29. The factual inconsistency problem in abstractive text summarization: A survey. ArXiv, abs/2104.14839, 2021. URL https://api.semanticscholar.org/CorpusID:233476302.
  30. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, mar 2023. doi: 10.1145/3571730. URL https://doi.org/10.1145/3571730.
  31. Hurdles to progress in long-form question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4940–4957, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.393. URL https://aclanthology.org/2021.naacl-main.393.
  32. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation, 2023.
  33. Hallucinations in neural machine translation. 2018. URL https://api.semanticscholar.org/CorpusID:53593076.
  34. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pp.  1430–1441, Santa Fe, New Mexico, USA, August 2018. Association for Computational Linguistics. URL https://aclanthology.org/C18-1121.
  35. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023.
  36. Slot-consistent NLG for task-oriented dialogue systems with iterative rectification network. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  97–106, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.10. URL https://aclanthology.org/2020.acl-main.10.
  37. Knowledge-grounded dialogue generation with a unified knowledge representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  206–218, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.15. URL https://aclanthology.org/2022.naacl-main.15.
  38. Incremental transformer with deliberation decoder for document grounded conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  12–21, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1002. URL https://aclanthology.org/P19-1002.
  39. Towards faithfulness in open domain table-to-text generation from an entity-centric view. In AAAI Conference on Artificial Intelligence, 2021. URL https://api.semanticscholar.org/CorpusID:231942490.
  40. A token-level reference-free hallucination detection benchmark for free-form text generation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  6723–6737, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.464. URL https://aclanthology.org/2022.acl-long.464.
  41. Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692, 2019. URL http://arxiv.org/abs/1907.11692.
  42. Decoupled weight decay regularization. In International Conference on Learning Representations, 2017. URL https://api.semanticscholar.org/CorpusID:53592270.
  43. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
  44. Identifying fluently inadequate output in neural and statistical machine translation. In Proceedings of Machine Translation Summit XVII: Research Track, pp.  233–243, Dublin, Ireland, August 2019. European Association for Machine Translation. URL https://aclanthology.org/W19-6623.
  45. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1906–1919, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.173. URL https://aclanthology.org/2020.acl-main.173.
  46. Improving factual consistency between a response and persona facts. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  549–562, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.44. URL https://aclanthology.org/2021.eacl-main.44.
  47. Mteb: Massive text embedding benchmark, 2023.
  48. Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation. arXiv preprint arXiv:2305.15852, 2023.
  49. Correcting length bias in neural machine translation. In Proceedings of the Third Conference on Machine Translation: Research Papers, pp.  212–223, 2018.
  50. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021.
  51. A simple recipe towards reducing hallucination in neural surface realisation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  2673–2679, Florence, Italy, July 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1256. URL https://aclanthology.org/P19-1256.
  52. Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2020.
  53. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4812–4829, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.383. URL https://aclanthology.org/2021.naacl-main.383.
  54. ToTTo: A controlled table-to-text generation dataset. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.  1173–1186, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.89. URL https://aclanthology.org/2020.emnlp-main.89.
  55. Data-to-text generation with content selection and planning. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019, pp.  6908–6915. AAAI Press, 2019. doi: 10.1609/aaai.v33i01.33016908. URL https://doi.org/10.1609/aaai.v33i01.33016908.
  56. Language models are unsupervised multitask learners. 2019.
  57. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  58. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  704–718, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl-long.58. URL https://aclanthology.org/2021.acl-long.58.
  59. Controlling hallucinations at word level in data-to-text generation. Data Mining and Knowledge Discovery, 36:318 – 354, 2021a. URL https://api.semanticscholar.org/CorpusID:231802211.
  60. Data-QuestEval: A referenceless metric for data-to-text semantic evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8029–8036, Online and Punta Cana, Dominican Republic, November 2021b. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.633. URL https://aclanthology.org/2021.emnlp-main.633.
  61. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. ArXiv, abs/2110.05456, 2021. URL https://api.semanticscholar.org/CorpusID:238583083.
  62. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  3784–3803, Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-emnlp.320. URL https://aclanthology.org/2021.findings-emnlp.320.
  63. Joint parsing and generation for abstractive summarization. ArXiv, abs/1911.10389, 2019. URL https://api.semanticscholar.org/CorpusID:208267908.
  64. Sticking to the facts: Confident decoding for faithful data-to-text generation. ArXiv, abs/1910.08684, 2019. URL https://api.semanticscholar.org/CorpusID:204800468.
  65. Sketch and refine: Towards faithful and informative table-to-text generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  4831–4843, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.427. URL https://aclanthology.org/2021.findings-acl.427.
  66. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=1PL1NIMMrw.
  67. Towards faithful neural table-to-text generation with content-matching constraints. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1072–1086, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.101. URL https://aclanthology.org/2020.acl-main.101.
  68. Semantically conditioned LSTM-based natural language generation for spoken dialogue systems. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp.  1711–1721, Lisbon, Portugal, September 2015. Association for Computational Linguistics. doi: 10.18653/v1/D15-1199. URL https://aclanthology.org/D15-1199.
  69. A controllable model of grounded response generation. Proceedings of the AAAI Conference on Artificial Intelligence, 35(16):14085–14093, May 2021. doi: 10.1609/aaai.v35i16.17658. URL https://ojs.aaai.org/index.php/AAAI/article/view/17658.
  70. On hallucination and predictive uncertainty in conditional language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pp.  2734–2744, Online, April 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.eacl-main.236. URL https://aclanthology.org/2021.eacl-main.236.
  71. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  72. Detecting hallucinated content in conditional neural sequence generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp.  1393–1404, Online, August 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.findings-acl.120. URL https://aclanthology.org/2021.findings-acl.120.
  73. A dataset for document grounded conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018.
  74. Enhancing factual consistency of abstractive summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  718–733, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.58. URL https://aclanthology.org/2021.naacl-main.58.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yixin Wan (19 papers)
  2. Fanyou Wu (10 papers)
  3. Weijie Xu (28 papers)
  4. Srinivasan H. Sengamedu (10 papers)
Citations (5)
X Twitter Logo Streamline Icon: https://streamlinehq.com