Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation (2311.15296v3)

Published 26 Nov 2023 in cs.CL
UHGEval: Benchmarking the Hallucination of Chinese Large Language Models via Unconstrained Generation

Abstract: LLMs have emerged as pivotal contributors in contemporary natural language processing and are increasingly being applied across a diverse range of industries. However, these large-scale probabilistic statistical models cannot currently ensure the requisite quality in professional content generation. These models often produce hallucinated text, compromising their practical utility in professional contexts. To assess the authentic reliability of LLMs in text generation, numerous initiatives have developed benchmark evaluations for hallucination phenomena. Nevertheless, these benchmarks frequently utilize constrained generation techniques due to cost and temporal constraints. These techniques encompass the use of directed hallucination induction and strategies that deliberately alter authentic text to produce hallucinations. These approaches are not congruent with the unrestricted text generation demanded by real-world applications. Furthermore, a well-established Chinese-language dataset dedicated to the evaluation of hallucinations in text generation is presently lacking. Consequently, we have developed an Unconstrained Hallucination Generation Evaluation (UHGEval) benchmark, designed to compile outputs produced with minimal restrictions by LLMs. Concurrently, we have established a comprehensive benchmark evaluation framework to aid subsequent researchers in undertaking scalable and reproducible experiments. We have also executed extensive experiments, evaluating prominent Chinese LLMs and the GPT series models to derive professional performance insights regarding hallucination challenges.

Benchmarking Hallucinations in Chinese LLMs with UHGEval

The paper presents an in-depth exploration of LLMs’ (LLMs) tendency to produce hallucinated text, thereby limiting their reliability in professional applications. Addressing a critical gap in the assessment of hallucinatory phenomena, the authors introduce UHG\mathbb{UHG}Eval, an Unconstrained Hallucination Generation Evaluation benchmark that specifically targets Chinese LLMs’ outputs within a news context.

Overview and Methodological Contributions

The paper critiques existing benchmarks that utilize constrained generation techniques due to cost and time constraints, arguing that these methods fall short of reflecting real-world applications where generation is typically unrestricted. Such benchmarks often employ induced hallucination or alter legitimate texts to fabricate hallucinations, which may not accurately mimic genuine creative or generative errors.

A significant contribution of this work is the introduction of a comprehensive Chinese-language dataset that captures hallucinations in an unconstrained manner. The dataset comprises over 5,000 annotated items derived from historical news articles, categorized into document-intensive, number-intensive, knowledge-intensive, and general news. This categorization recognizes the different ways hallucinations might manifest across various content types.

Dataset Construction

The dataset construction includes a two-stage annotation process: a hallucination ranking followed by automatic labeling and human rechecking. The hallucination ranking algorithm prioritizes fluency and the likelihood of hallucination, selecting text candidates that strike a balance between coherence and hallucinatory potential. The authors propose the keyword precision (kwPrec) metric as a superior alternative to traditional BLEU and ROUGE scoring methods, arguing it better identifies essential fact-related inaccuracies.

The authors ensured the inclusivity of multiple LLMs in dataset creation, incorporating five Chinese models to generate hallucinations, thus providing greater diversity and reducing the risk of model-specific biases.

Evaluation Framework

The proposed evaluation framework includes discriminative, selective, and generative evaluations. Discriminative evaluation demands LLMs determine the presence of hallucinations; selective evaluation requires discerning between text options, with and without hallucinations; and generative evaluation involves analyzing LLM-generated continuations for hallucinated content using reference-based techniques.

In the empirical analysis, the authors tested eight prominent Chinese LLMs and three GPT series models, offering significant insights into their hallucination dynamics. Notably, they reported that most models performed better with number-intensive and general news, echoing the complexity of numeric data and societal narratives in generating reliable outputs.

Discussion and Implication

The findings suggest that Chinese LLMs, especially domain-specific ones like Xinyu2-70B, excel in selective evaluation, indicative of their robustness in narrower contexts such as news domains. Generative evaluation posed challenges, revealing the inherent difficulty for LLMs in producing factually correct and coherent continuations without hallucination.

By aligning evaluative tasks with real-world scenarios, this research underscores the need to improve LLM training methods, potentially guiding enhancements in knowledge integration and retrieval processes. It points to the profound implication that leveraging specific domain knowledge significantly improves the factual accuracy of LLM outputs.

Conclusion and Future Directions

This paper's contributions in constructing and evaluating Chinese LLMs through UHG\mathbb{UHG}Eval establish a rigorous standard for assessing hallucinated content generation without constraints, paving the way for more reliable LLM applications in professional fields such as journalism and academia. Future research will likely focus on expanding this benchmark across other languages and domains, enhancing LLMs' ability to generate reliable, contextually appropriate content across various applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  2. V. Rawte, S. Chakraborty, A. Pathak, A. Sarkar, S. Tonmoy, A. Chadha et al., “The troubling emergence of hallucination in large language models–an extensive definition, quantification, and prescriptive remediations,” arXiv preprint arXiv:2310.04988, 2023.
  3. C. Wang, X. Liu, Y. Yue, X. Tang, T. Zhang, C. Jiayang et al., “Survey on factuality in large language models: Knowledge, retrieval and domain-specificity,” arXiv preprint arXiv:2310.07521, 2023.
  4. V. Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,” arXiv preprint arXiv:2309.05922, 2023.
  5. Y. Zhang, Y. Li, L. Cui, D. Cai, L. Liu, T. Fu et al., “Siren’s song in the ai ocean: A survey on hallucination in large language models,” arXiv preprint arXiv:2309.01219, 2023.
  6. J. Li, X. Cheng, W. X. Zhao, J.-Y. Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” arXiv preprint arXiv:2305.11747, 2023.
  7. T. Liu, Y. Zhang, C. Brockett, Y. Mao, Z. Sui, W. Chen et al., “A token-level reference-free hallucination detection benchmark for free-form text generation,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds.   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 6723–6737. [Online]. Available: https://aclanthology.org/2022.acl-long.464
  8. S. Yang, R. Sun, and X. Wan, “A new benchmark and reverse validation method for passage-level hallucination detection,” arXiv preprint arXiv:2310.06498, 2023.
  9. S. Min, K. Krishna, X. Lyu, M. Lewis, W.-t. Yih, P. W. Koh et al., “Factscore: Fine-grained atomic evaluation of factual precision in long form text generation,” arXiv preprint arXiv:2305.14251, 2023.
  10. D. Muhlgay, O. Ram, I. Magar, Y. Levine, N. Ratner, Y. Belinkov et al., “Generating benchmarks for factuality evaluation of language models,” arXiv preprint arXiv:2307.06908, 2023.
  11. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin et al., “Training language models to follow instructions with human feedback,” in Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, Eds., vol. 35.   Curran Associates, Inc., 2022, pp. 27 730–27 744. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2022/file/b1efde53be364a73914f58805a001731-Paper-Conference.pdf
  12. Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang et al., “Glm: General language model pretraining with autoregressive blank infilling,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 320–335.
  13. A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023.
  14. J. Bai, S. Bai, Y. Chu, Z. Cui, K. Dang, X. Deng et al., “Qwen technical report,” arXiv preprint arXiv:2309.16609, 2023.
  15. InternLM, “Internlm: A multilingual language model with progressively enhanced capabilities,” https://github.com/InternLM/InternLM, 2023.
  16. N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao et al., “Crosslingual generalization through multitask finetuning,” arXiv preprint arXiv:2211.01786, 2023.
  17. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  18. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, P. Isabelle, E. Charniak, and D. Lin, Eds.   Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 2002, pp. 311–318. [Online]. Available: https://aclanthology.org/P02-1040
  19. C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Text Summarization Branches Out.   Barcelona, Spain: Association for Computational Linguistics, Jul. 2004, pp. 74–81. [Online]. Available: https://aclanthology.org/W04-1013
  20. OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  21. M.-C. de Marneffe and J. Nivre, “Dependency grammar,” Annual Review of Linguistics, vol. 5, no. 1, pp. 197–218, 2019. [Online]. Available: https://doi.org/10.1146/annurev-linguistics-011718-011842
  22. BAAI, “Aquila2,” https://github.com/FlagAI-Open/Aquila2, 2023.
  23. Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu et al., “A survey on evaluation of large language models,” arXiv preprint arXiv:2307.03109, 2023.
  24. Q. Cheng, T. Sun, W. Zhang, S. Wang, X. Liu, M. Zhang et al., “Evaluating hallucinations in chinese large language models,” arXiv preprint arXiv:2310.03368, 2023.
  25. Y. Wang, Z. Yu, Z. Zeng, L. Yang, C. Wang, H. Chen et al., “Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization,” arXiv preprint arXiv:2306.05087, 2023.
  26. J. Novikova, O. Dušek, A. Cercas Curry, and V. Rieser, “Why we need new evaluation metrics for NLG,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, M. Palmer, R. Hwa, and S. Riedel, Eds.   Copenhagen, Denmark: Association for Computational Linguistics, Sep. 2017, pp. 2241–2252. [Online]. Available: https://aclanthology.org/D17-1238
  27. T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “Bertscore: Evaluating text generation with bert,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=SkeHuCVFDr
  28. S. Lin, J. Hilton, and O. Evans, “TruthfulQA: Measuring how models mimic human falsehoods,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), S. Muresan, P. Nakov, and A. Villavicencio, Eds.   Dublin, Ireland: Association for Computational Linguistics, May 2022, pp. 3214–3252. [Online]. Available: https://aclanthology.org/2022.acl-long.229
  29. J. Fu, S.-K. Ng, Z. Jiang, and P. Liu, “Gptscore: Evaluate as you desire,” arXiv preprint arXiv:2302.04166, 2023.
  30. S. Zheng, Y. Zhang, Y. Zhu, C. Xi, P. Gao, X. Zhou et al., “Gpt-fathom: Benchmarking large language models to decipher the evolutionary path towards gpt-4 and beyond,” arXiv preprint arXiv:2309.16583, 2023.
  31. Y. Sun, S. Wang, S. Feng, S. Ding, C. Pang, J. Shang et al., “Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation,” arXiv preprint arXiv:2107.02137, 2021.
  32. B. Wang, E. Chern, and P. Liu, “Chinesefacteval: A factuality benchmark for chinese llms,” https://GAIR-NLP.github.io/ChineseFactEval, 2023.
  33. J. Chen, W. Shi, Z. Fu, S. Cheng, L. Li, and Y. Xiao, “Say what you mean! large language models speak too positively about negative commonsense knowledge,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 9890–9908. [Online]. Available: https://aclanthology.org/2023.acl-long.550
  34. M. Elaraby, M. Lu, J. Dunn, X. Zhang, Y. Wang, and S. Liu, “Halo: Estimation and reduction of hallucinations in open-source weak large language models,” arXiv preprint arXiv:2308.11764, 2023.
  35. N. Lee, W. Ping, P. Xu, M. Patwary, P. Fung, M. Shoeybi et al., “Factuality enhanced language models for open-ended text generation,” in Advances in Neural Information Processing Systems, A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, Eds., 2022. [Online]. Available: https://openreview.net/forum?id=LvyJX20Rll
  36. J. Yu, X. Wang, S. Tu, S. Cao, D. Zhang-Li, X. Lv et al., “Kola: Carefully benchmarking world knowledge of large language models,” arXiv preprint arXiv:2306.09296, 2023.
  37. A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Med-halt: Medical domain hallucination test for large language models,” arXiv preprint arXiv:2307.15343, 2023.
  38. Z. Yin, Q. Sun, Q. Guo, J. Wu, X. Qiu, and X. Huang, “Do large language models know what they don’t know?” in Findings of the Association for Computational Linguistics: ACL 2023, A. Rogers, J. Boyd-Graber, and N. Okazaki, Eds.   Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 8653–8665. [Online]. Available: https://aclanthology.org/2023.findings-acl.551
  39. N. Varshney, W. Yao, H. Zhang, J. Chen, and D. Yu, “A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation,” arXiv preprint arXiv:2307.03987, 2023.
  40. J. Maynez, S. Narayan, B. Bohnet, and R. McDonald, “On faithfulness and factuality in abstractive summarization,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, D. Jurafsky, J. Chai, N. Schluter, and J. Tetreault, Eds.   Online: Association for Computational Linguistics, Jul. 2020, pp. 1906–1919. [Online]. Available: https://aclanthology.org/2020.acl-main.173
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Xun Liang (12 papers)
  2. Shichao Song (19 papers)
  3. Simin Niu (15 papers)
  4. Zhiyu Li (69 papers)
  5. Feiyu Xiong (53 papers)
  6. Bo Tang (111 papers)
  7. Dawei He (13 papers)
  8. Peng Cheng (229 papers)
  9. Zhonghao Wang (20 papers)
  10. Haiying Deng (5 papers)
  11. Yezhaohui Wang (6 papers)
Citations (15)