Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models are Inconsistent and Biased Evaluators (2405.01724v1)

Published 2 May 2024 in cs.CL and cs.AI
Large Language Models are Inconsistent and Biased Evaluators

Abstract: The zero-shot capability of LLMs has enabled highly flexible, reference-free metrics for various tasks, making LLM evaluators common tools in NLP. However, the robustness of these LLM evaluators remains relatively understudied; existing work mainly pursued optimal performance in terms of correlating LLM scores with human expert scores. In this paper, we conduct a series of analyses using the SummEval dataset and confirm that LLMs are biased evaluators as they: (1) exhibit familiarity bias-a preference for text with lower perplexity, (2) show skewed and biased distributions of ratings, and (3) experience anchoring effects for multi-attribute judgments. We also found that LLMs are inconsistent evaluators, showing low "inter-sample" agreement and sensitivity to prompt differences that are insignificant to human understanding of text quality. Furthermore, we share recipes for configuring LLM evaluators to mitigate these limitations. Experimental results on the RoSE dataset demonstrate improvements over the state-of-the-art LLM evaluators.

This paper "LLMs are Inconsistent and Biased Evaluators" (Stureborg et al., 2 May 2024 ) investigates the robustness of LLMs when used as automatic evaluators, often referred to as "LLM-as-a-judge". While LLM evaluators offer flexibility and reference-free assessment capabilities, the paper highlights significant issues related to their biases and inconsistencies, which are often overlooked in favor of focusing solely on correlation with human scores. The authors conduct extensive analysis using GPT-3.5 and GPT-4 on the SummEval dataset, followed by a case paper applying proposed mitigation strategies on the RoSE dataset.

The core problem addressed is the lack of understanding and mitigation of intrinsic flaws in LLM-based evaluation, which can lead to unreliable judgments, especially in sensitive applications. The paper identifies several specific biases and inconsistencies:

  1. Familiarity Bias: LLMs show a stronger preference for text with lower perplexity (more predictable/familiar) than human experts. This is demonstrated by analyzing the average perplexity of summaries assigned different scores by LLMs and humans, showing a more pronounced negative correlation for LLMs (Figure 2, Table 1).
  2. Scoring Granularity and Score Biases:
    • When instructed to use a wide scoring range (e.g., 1-100), LLMs fail to utilize the full range, clustering scores within a narrow band (e.g., 70-100 in Figure 3).
    • LLMs exhibit a "round number bias," disproportionately assigning scores that are multiples of 5 or 10 (e.g., 90, 95 are much more frequent than 92, 93 in Figure 3).
    • Experiments with different scoring scales (1-5, 1-10, 1-100, with modifiers or sample averaging) show varying performance. The 1-10 integer scale achieved the best average correlation with human judgments on SummEval (Table 2), suggesting diminishing returns or even harm from attempts to force higher granularity through complex scales or averaging multiple samples at high temperature.
  3. Anchoring Effect in Multiple Judgments: When evaluating multiple attributes simultaneously (e.g., Coherence, Consistency, Fluency, Relevance), the score assigned to one attribute significantly biases the scores assigned to subsequent attributes in the same output. The order of evaluation matters, and later attributes show degraded correlation with human judgments (Figure 4, Table 3). This is attributed to the auto-regressive nature of LLMs.
  4. Self-Inconsistency: LLMs demonstrate lower "inter-sample" agreement (consistency across multiple generations for the same input and prompt) compared to human inter-annotator agreement (Table 4, Table 7). Their judgments can be sensitive to minor prompt variations or temperature settings.
  5. Sensitivity to Temperature and Chain-of-Thought (CoT): Standard prompt engineering advice (low temperature for deterministic output, CoT for better reasoning) doesn't always translate well to LLM evaluation. Non-CoT prompts perform better at low temperatures (close to 0), while CoT prompts benefit from moderate temperatures (around 0.5), although performance drops off at very high temperatures (Figure 6, Figure 11). Notably, a single generation at temperature 0 with a non-CoT prompt proved effective in their experiments.
  6. Sensitivity to Source Document: The presence of the source document impacts ratings even for attributes like Fluency that should theoretically only depend on the summary text itself (Table 5). This suggests the model might be relying on spurious correlations with the source rather than a pure assessment of the summary's quality on that dimension.

Based on these findings, the paper proposes a set of practical recipes to mitigate these issues when using LLMs for evaluation (Table 6):

  • Scoring Scale: Use a 1-10 integer scale instead of 1-5 or 1-100 or scales with complex modifiers.
  • Temperature and CoT: Use non-CoT prompting and set the temperature to 0 for deterministic single-sample generation.
  • Source Document: Always include the source document, even for attributes that might seem independent of it.
  • Attribute Judgment: Predict only one attribute (e.g., Coherence, Consistency, Fluency, Relevance) per generation call to avoid anchoring effects.

The authors conducted a case paper on the RoSE dataset to evaluate the effectiveness of these recipes. They implemented their approach using GPT-4-Turbo with the proposed recipes (1-10 scale, non-CoT, temp 0, single generation, include source, single attribute per call, include evaluation steps/definition) and compared it against re-implementations of two previous LLM evaluation methods, G-Eval [2023b] and Chiang and Lee [2023]. The results on RoSE (Figure 7) showed that their method statistically significantly outperformed the SOTA method [2023] on the in-domain CNNDM partition and G-Eval on both CNNDM and SAMSum partitions. This empirical paper verifies that applying these simple mitigation strategies can lead to improved correlation with human judgments.

Implementation Considerations:

  • Prompting: The prompt structure described (System prompt defining the task and scale, User prompt providing document/summary and asking for the score in a specific format, e.g., "Evaluation Form (Scores ONLY): {metric}:") is crucial. Parsing the exact numerical score from the LLM's output requires careful handling, potentially using regular expressions to extract the first digit(s). The authors mention stopping generation early and parsing to find the score.
  • API Calls: Implementing single-attribute evaluation per generation means making multiple API calls per summary if evaluating multiple dimensions. This increases cost and latency compared to predicting all attributes in one call, but is necessary to avoid anchoring bias.
  • Model Choice: The analysis primarily uses GPT models. While the findings and recipes are based on these models, the extent to which they generalize to other open-source or proprietary LLMs requires further investigation. The paper notes this as a limitation.
  • Dataset Dependency: The analysis relies heavily on SummEval. While the case paper on RoSE confirms some findings, the specific optimal settings or severity of biases might vary across different tasks and datasets.
  • Cost: Generating a large volume of evaluations, especially with multi-sample averaging or multiple API calls per summary, can incur substantial costs when using paid APIs like OpenAI's. The choice of temperature 0 and single generation helps manage cost.

In summary, the paper provides a critical look at the reliability of LLMs as evaluators, detailing specific biases and inconsistencies. It offers actionable implementation recipes, backed by experimental results, showing how to configure LLM evaluators to achieve better alignment with human judgments, particularly advocating for a 1-10 scale, low-temperature non-CoT prompting, and evaluating attributes separately.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  2. Re-evaluating evaluation in text summarization. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9347–9359, Online. Association for Computational Linguistics.
  3. Chateval: Towards better llm-based evaluators through multi-agent debate.
  4. Exploring the use of large language models for reference-free text quality evaluation: A preliminary empirical study.
  5. Polyie: A dataset of information extraction from polymer material scientific literature.
  6. Cheng-Han Chiang and Hung-yi Lee. 2023. A closer look into using large language models for automatic evaluation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8928–8942, Singapore. Association for Computational Linguistics.
  7. Nikolas Coupland. 2011. How frequent are numbers? Language & Communication, 31(1):27–37.
  8. On the limitations of reference-free evaluations of generated text. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10960–10977, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  9. SummEval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  10. Findings of the WMT 2019 shared tasks on quality estimation. In Proceedings of the Fourth Conference on Machine Translation (Volume 3: Shared Task Papers, Day 2), pages 1–10, Florence, Italy. Association for Computational Linguistics.
  11. Gptscore: Evaluate as you desire.
  12. Human-like summarization evaluation with chatgpt.
  13. SAMSum corpus: A human-annotated dialogue dataset for abstractive summarization. In Proceedings of the 2nd Workshop on New Frontiers in Summarization. Association for Computational Linguistics.
  14. On the round number bias and wisdom of crowds in different response formats for numerical estimation. Scientific Reports, 12(1):8167.
  15. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.
  16. Zdeněk Kasner and Ondřej Dušek. 2024. Beyond reference-based metrics: Analyzing behaviors of open llms on data-to-text generation.
  17. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality.
  18. Benchmarking cognitive biases in large language models as evaluators.
  19. Leveraging large language models for nlg evaluation: A survey.
  20. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  21. Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models.
  22. G-eval: Nlg evaluation using gpt-4 with better human alignment.
  23. LLMs as narcissistic evaluators: When ego inflates evaluation scores.
  24. Revisiting the gold standard: Grounding summarization evaluation with robust human evaluation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4140–4170, Toronto, Canada. Association for Computational Linguistics.
  25. Annie Louis and Ani Nenkova. 2013. Automatically assessing machine summary content without a gold standard. Computational Linguistics, 39(2):267–300.
  26. Chatgpt as a factual inconsistency evaluator for text summarization.
  27. Abstractive text summarization using sequence-to-sequence rnns and beyond.
  28. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization.
  29. Tomoko Nemoto and David Beglar. 2014. Likert-scale questionnaires. In JALT 2013 conference proceedings, pages 1–8.
  30. Likelihood-based mitigation of evaluation bias in large language models.
  31. Llm evaluators recognize and favor their own generations.
  32. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  33. Are references really needed? unbabel-IST 2021 submission for the metrics shared task. In Proceedings of the Sixth Conference on Machine Translation, pages 1030–1040, Online. Association for Computational Linguistics.
  34. QuestEval: Summarization asks for fact-based evaluation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6594–6604, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  35. Answers unite! unsupervised metrics for reinforced summarization models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3246–3256, Hong Kong, China. Association for Computational Linguistics.
  36. BLEURT: Learning robust metrics for text generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7881–7892, Online. Association for Computational Linguistics.
  37. Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4215–4233, Singapore. Association for Computational Linguistics.
  38. Characterizing the confidence of large language model-based automatic evaluation metrics. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 76–89, St. Julian’s, Malta. Association for Computational Linguistics.
  39. Manoj Thomas and Vicki Morwitz. 2009. Heuristics in numerical cognition: Implications for pricing. In Handbook of pricing research in marketing, pages 132–149. Edward Elgar Publishing.
  40. Amos Tversky and Daniel Kahneman. 1974. Judgment under uncertainty: Heuristics and biases. Science, 185(4157):1124–1131.
  41. Fill in the BLANC: Human-free quality estimation of document summaries. In Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pages 11–20, Online. Association for Computational Linguistics.
  42. Is chatgpt a good nlg evaluator? a preliminary study.
  43. Large language models are not fair evaluators.
  44. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
  45. Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models.
  46. Less is more for long document summary evaluation by llms.
  47. Robert B Zajonc. 1968. Attitudinal effects of mere exposure. Journal of personality and social psychology, 9(2p2):1.
  48. BertScore: Evaluating text generation with bert.
  49. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
  50. Large language models are not robust multiple choice selectors.
  51. Judging llm-as-a-judge with mt-bench and chatbot arena.
  52. Hierarchical multi-label classification of online vaccine concerns.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Rickard Stureborg (9 papers)
  2. Dimitris Alikaniotis (4 papers)
  3. Yoshi Suhara (14 papers)
Citations (19)
Youtube Logo Streamline Icon: https://streamlinehq.com