Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Are LLM-based Evaluators Confusing NLG Quality Criteria? (2402.12055v2)

Published 19 Feb 2024 in cs.CL

Abstract: Some prior work has shown that LLMs perform well in NLG evaluation for different tasks. However, we discover that LLMs seem to confuse different evaluation criteria, which reduces their reliability. For further verification, we first consider avoiding issues of inconsistent conceptualization and vague expression in existing NLG quality criteria themselves. So we summarize a clear hierarchical classification system for 11 common aspects with corresponding different criteria from previous studies involved. Inspired by behavioral testing, we elaborately design 18 types of aspect-targeted perturbation attacks for fine-grained analysis of the evaluation behaviors of different LLMs. We also conduct human annotations beyond the guidance of the classification system to validate the impact of the perturbations. Our experimental results reveal confusion issues inherent in LLMs, as well as other noteworthy phenomena, and necessitate further research and improvements for LLM-based evaluation.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xinyu Hu (32 papers)
  2. Mingqi Gao (29 papers)
  3. Sen Hu (32 papers)
  4. Yang Zhang (1129 papers)
  5. Yicheng Chen (24 papers)
  6. Teng Xu (21 papers)
  7. Xiaojun Wan (99 papers)
Citations (8)