CritiqueLLM: Towards an Informative Critique Generation Model for Evaluation of Large Language Model Generation (2311.18702v2)
Abstract: Since the NLP community started to make LLMs act as a critic to evaluate the quality of generated texts, most of the existing works train a critique generation model on the evaluation data labeled by GPT-4's direct prompting. We observe that these models lack the ability to generate informative critiques in both pointwise grading and pairwise comparison especially without references. As a result, their generated critiques cannot provide fine-grained distinguishability on generated texts, causing unsatisfactory evaluation performance. In this paper, we propose a simple yet effective method called Eval-Instruct, which can first acquire pointwise grading critiques with pseudo references and then revise these critiques via multi-path prompting to obtain informative evaluation data in different tasks and settings, including pointwise grading and pairwise comparison with / without references. After fine-tuning on these data, the resulting model CritiqueLLM is empirically shown to outperform ChatGPT and all the open-source baselines and even achieve comparable evaluation performance to GPT-4 in system-level correlations of pointwise grading. We also demonstrate that our generated critiques can act as scalable feedback to further improve the generation quality of strong LLMs like ChatGPT.
- Pei Ke (38 papers)
- Bosi Wen (9 papers)
- Zhuoer Feng (5 papers)
- Xiao Liu (402 papers)
- Xuanyu Lei (10 papers)
- Jiale Cheng (19 papers)
- Shengyuan Wang (5 papers)
- Aohan Zeng (19 papers)
- Yuxiao Dong (119 papers)
- Hongning Wang (107 papers)
- Jie Tang (302 papers)
- Minlie Huang (226 papers)