Likelihood-based Mitigation of Evaluation Bias in Large Language Models (2402.15987v3)

Published 25 Feb 2024 in cs.CL and cs.AI

Abstract: LLMs are widely used to evaluate natural language generation tasks as automated metrics. However, the likelihood, a measure of LLM's plausibility for a sentence, can vary due to superficial differences in sentences, such as word order and sentence structure. It is therefore possible that there might be a likelihood bias if LLMs are used for evaluation: they might overrate sentences with higher likelihoods while underrating those with lower likelihoods. In this paper, we investigate the presence and impact of likelihood bias in LLM-based evaluators. We also propose a method to mitigate the likelihood bias. Our method utilizes highly biased instances as few-shot examples for in-context learning. Our experiments in evaluating the data-to-text and grammatical error correction tasks reveal that several LLMs we test display a likelihood bias. Furthermore, our proposed method successfully mitigates this bias, also improving evaluation performance (in terms of correlation of models with human scores) significantly.

PDF HTML Abstract

Likelihood Bias in LLMs: Measurement and Mitigation

Introduction to Likelihood Bias

LLMs, with their advanced language comprehension and generation capabilities, have increasingly been employed as evaluators in natural language generation tasks, outperforming traditional automatic metrics in terms of alignment with human judgment. However, the evaluation process of LLMs, which relies heavily on likelihood estimations, may inadvertently favor texts that are more probable within the model's training corpus over those that are less likely but equally valid. This phenomenon, known as likelihood bias, can lead to discrepancies between LLM evaluations and human judgment, potentially undermining the utility of LLMs in automated evaluation tasks.

Identifying Likelihood Bias

The impact of likelihood bias was analyzed through extensive experiments, focusing on evaluation tasks inclusive of multiple criteria such as fluency and relevance, specifically data-to-text and Grammatical Error Correction (GEC). The findings confirmed the presence of likelihood bias across several LLMs, with these biases manifesting more significantly in criteria less intrinsically related to likelihood (e.g., relevance) than in those closely tied to it (e.g., fluency). The paper meticulously outlines the process for quantifying likelihood bias, employing a statistical correlation between the LLM-generated scores and human evaluations against the calculated likelihood of texts.

Mitigation Strategy

A novel mitigation strategy is proposed and demonstrated to effectively reduce likelihood bias while simultaneously enhancing the correlation of LLM evaluations with human judgment. This strategy involves the use of highly biased instances as few-shot examples for in-context learning, aiming to recalibrate the evaluative mechanisms of LLMs to decrease their likelihood bias. The efficacy of this strategy is validated through empirical results showing reduced bias scores and improved evaluation performance on the part of the LLMs post-mitigation.

Practical Implications and Future Directions

The revelation of a measurable and mitigable likelihood bias in LLM-based evaluators has several significant implications. Practically, it offers a pathway to refine automated evaluation tasks, making these assessments more reliable and aligned with human judgment. Theoretically, it sheds light on the underlying mechanisms of bias within LLMs, prompting a reevaluation of how these models understand and generate language. Looking forward, the research opens avenues for further exploration into mitigating other forms of bias in LLMs, potentially enhancing their applicability across a broader spectrum of tasks.

Conclusions

In conclusion, this paper provides a comprehensive examination of likelihood bias in LLM-based evaluation tasks, presenting a tangible solution to this problem. By introducing a method to precisely quantify this bias and proposing a practical approach for its reduction, it marks a significant step towards more equitable and accurate automated language evaluations. The implications of this research extend beyond the immediate context, signaling a crucial advancement in our understanding and utilization of LLMs in pursuit of unbiased natural language processing.

Ethical Considerations and Limitations

The research responsibly addresses potential ethical considerations and limitations inherent to its methodology. While focusing on mitigating likelihood bias, the authors acknowledge that their approach, centered around in-context learning, might not be universally applicable across all tasks due to limitations in token usage and increased computational demands. Moreover, it underscores the importance of future investigations into mitigating socially sensitive biases within LLM evaluations, hinting at the broader ethical implications of bias in AI systems.

PDF Markdown Bookmark Chat (Pro)

References (17)

Authors (5)

Masanari Ohi (9 papers)
Masahiro Kaneko (46 papers)
Ryuto Koike (6 papers)
Mengsay Loem (8 papers)
Naoaki Okazaki (70 papers)

Citations (3)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/stjohn2007/status/1762811348554289579