Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Likelihood-based Mitigation of Evaluation Bias in Large Language Models (2402.15987v3)

Published 25 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods. In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators. We also propose a method to mitigate the likelihood bias. Our method utilizes highly biased instances as few-shot examples for in-context learning. Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias. Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.

Likelihood Bias in LLMs: Measurement and Mitigation

Introduction to Likelihood Bias

LLMs, with their advanced language comprehension and generation capabilities, have increasingly been employed as evaluators in natural language generation tasks, outperforming traditional automatic metrics in terms of alignment with human judgment. However, the evaluation process of LLMs, which relies heavily on likelihood estimations, may inadvertently favor texts that are more probable within the model's training corpus over those that are less likely but equally valid. This phenomenon, known as likelihood bias, can lead to discrepancies between LLM evaluations and human judgment, potentially undermining the utility of LLMs in automated evaluation tasks.

Identifying Likelihood Bias

The impact of likelihood bias was analyzed through extensive experiments, focusing on evaluation tasks inclusive of multiple criteria such as fluency and relevance, specifically data-to-text and Grammatical Error Correction (GEC). The findings confirmed the presence of likelihood bias across several LLMs, with these biases manifesting more significantly in criteria less intrinsically related to likelihood (e.g., relevance) than in those closely tied to it (e.g., fluency). The paper meticulously outlines the process for quantifying likelihood bias, employing a statistical correlation between the LLM-generated scores and human evaluations against the calculated likelihood of texts.

Mitigation Strategy

A novel mitigation strategy is proposed and demonstrated to effectively reduce likelihood bias while simultaneously enhancing the correlation of LLM evaluations with human judgment. This strategy involves the use of highly biased instances as few-shot examples for in-context learning, aiming to recalibrate the evaluative mechanisms of LLMs to decrease their likelihood bias. The efficacy of this strategy is validated through empirical results showing reduced bias scores and improved evaluation performance on the part of the LLMs post-mitigation.

Practical Implications and Future Directions

The revelation of a measurable and mitigable likelihood bias in LLM-based evaluators has several significant implications. Practically, it offers a pathway to refine automated evaluation tasks, making these assessments more reliable and aligned with human judgment. Theoretically, it sheds light on the underlying mechanisms of bias within LLMs, prompting a reevaluation of how these models understand and generate language. Looking forward, the research opens avenues for further exploration into mitigating other forms of bias in LLMs, potentially enhancing their applicability across a broader spectrum of tasks.

Conclusions

In conclusion, this paper provides a comprehensive examination of likelihood bias in LLM-based evaluation tasks, presenting a tangible solution to this problem. By introducing a method to precisely quantify this bias and proposing a practical approach for its reduction, it marks a significant step towards more equitable and accurate automated language evaluations. The implications of this research extend beyond the immediate context, signaling a crucial advancement in our understanding and utilization of LLMs in pursuit of unbiased natural language processing.

Ethical Considerations and Limitations

The research responsibly addresses potential ethical considerations and limitations inherent to its methodology. While focusing on mitigating likelihood bias, the authors acknowledge that their approach, centered around in-context learning, might not be universally applicable across all tasks due to limitations in token usage and increased computational demands. Moreover, it underscores the importance of future investigations into mitigating socially sensitive biases within LLM evaluations, hinting at the broader ethical implications of bias in AI systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Evaluating gender bias of pre-trained language models in natural language inference by considering all labels. arXiv preprint arXiv:2309.09697.
  2. Palm 2 technical report.
  3. The 2020 bilingual, bi-directional WebNLG+ shared task: Overview and evaluation results (WebNLG+ 2020). In Proceedings of the 3rd International Workshop on Natural Language Generation from the Semantic Web (WebNLG+), pages 55–76, Dublin, Ireland (Virtual). Association for Computational Linguistics.
  4. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  5. Experts, errors, and context: A large-scale study of human evaluation for machine translation. Transactions of the Association for Computational Linguistics, 9:1460–1474.
  6. OpenMEVA: A benchmark for evaluating open-ended story generation metrics. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 6394–6407, Online. Association for Computational Linguistics.
  7. Evaluating open-domain question answering in the era of large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5591–5606, Toronto, Canada. Association for Computational Linguistics.
  8. The impact of debiasing on the performance of language models in downstream tasks is underestimated. arXiv preprint arXiv:2309.09092.
  9. Tom Kocmi and Christian Federmann. 2023. Large language models are state-of-the-art evaluators of translation quality. In Proceedings of the 24th Annual Conference of the European Association for Machine Translation, pages 193–203, Tampere, Finland. European Association for Machine Translation.
  10. Language models as an alternative evaluator of word order hypotheses: A case study in Japanese. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 488–504, Online. Association for Computational Linguistics.
  11. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  12. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  13. In-contextual bias suppression for large language models. arXiv preprint arXiv:2309.07251.
  14. OpenAI. 2023. Gpt-4 technical report.
  15. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
  16. Llama 2: Open foundation and fine-tuned chat models.
  17. SOME: Reference-less sub-metrics optimized for manual evaluations of grammatical error correction. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6516–6522, Barcelona, Spain (Online). International Committee on Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Masanari Ohi (9 papers)
  2. Masahiro Kaneko (46 papers)
  3. Ryuto Koike (6 papers)
  4. Mengsay Loem (8 papers)
  5. Naoaki Okazaki (70 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com