Style Over Substance: Evaluation Biases for Large Language Models (2307.03025v3)

Published 6 Jul 2023 in cs.CL

Abstract: As LLMs continue to advance, accurately and comprehensively evaluating their performance becomes increasingly challenging. Ranking the relative performance of LLMs based on Elo ratings, according to human judgment, is gaining more popularity. However, the extent to which humans and LLMs are capable evaluators remains uncertain. This study investigates the behavior of crowd-sourced and expert annotators, as well as LLMs, when comparing outputs from different models. To achieve this, we curate a dataset of intentionally flawed machine-generated answers. Our findings reveal a concerning bias in the evaluation process, as answers with factual errors are rated more favorably than answers that are too short or contained grammatical errors. To address this issue, we propose independently evaluating machine-generated text across multiple dimensions, rather than merging all the evaluation aspects into a single score. We instantiate this idea with the Elo rating system, resulting in the Multi-Elo Rating System (MERS). Empirical results from our study reveal that this proposed approach significantly enhances the quality of LLM-based evaluations, particularly in terms of factual accuracy. However, there is no significant improvement in crowd-sourced-based evaluations, indicating the need for further investigation.

PDF Abstract

Evaluation Biases in LLMs: An Analytical Approach

The paper, titled "Style Over Substance: Evaluation Biases for LLMs," addresses the growing complexity in evaluating the performance of LLMs amidst their rapid advancements. As the demand for effective assessment of LLM capabilities intensifies, the traditional methodologies are often inadequate, primarily because of their single-dimensional focus. The authors propose a novel framework for evaluating LLM outputs by employing a Multi-Elo Rating System (MERS).

Central Thesis and Methodology

The primary contention of the paper is the presence of evaluation biases, particularly the convolution of multiple assessment criteria into a singular score, which often undermines critical components such as factual accuracy. To validate this hypothesis, the researchers curated a dataset of intentionally flawed answers generated by GPT-4, incorporating various types of errors—grammatical, syntactic, factual, and other linguistic variations.

The paper systematically utilized various judge profiles: crowd-sourced annotators, expert annotators, GPT-4, and Claude-1. By doing so, it elucidated biases evident in human judgments and examined their contrasts with LLM evaluations. The proposed MERS evaluates answers across multiple dimensions—Accuracy, Helpfulness, and Language—to provide a comprehensive understanding of output quality. Subsequent empirical findings from the paper demonstrate that MERS significantly enhances GPT-4 evaluations, particularly in factual accuracy.

Key Findings

Bias Towards Longer Texts: One salient observation is a widespread bias towards longer texts across all judge profiles, human and machine alike. LLMs demonstrate greater certainty in evaluations compared to human judges, with a pronounced preference for detailed responses, notwithstanding factual inaccuracies.
Human Indecisiveness: Human judges exhibited hesitancy and a reduced tendency to fact-check unless errors were overt. This revelation challenges the assumption that human evaluation is inherently superior or “gold standard,” especially when factual accuracy is compromised for verbosity or grammatical correctness.
Answer Order Effect: The paper identifies an inherent bias related to the order of answers presented to the judges, highlighting that this sequencing significantly influences judgment preferences.
Inadequate Fact-checking by Crowd Annotators: Crowd-sourced annotators displayed insufficient diligence in fact-checking, promoting the risk of misinterpreting or accepting erroneous information in LLM outputs.
Moderate LLM Consensus: While LLMs like GPT-4 and Claude-1 show moderate agreement, human annotators display broader divergence in evaluations, suggesting a diverse landscape of understanding or criteria prioritization.

Implications and Future Directions

The fragmentation of evaluation into separate dimensions offers an advanced mechanism to scrutinize the varied aspects of LLM performance. The MERS framework provides a more transparent and robust alternative to traditional single-score evaluations. Moreover, the paper underscores a vital caution for practitioners: reliance on crowd-sourced annotations might inadvertently endorse suboptimal evaluation due to decisiveness issues and a lack of rigorous fact-checking.

Moving forward, the integration of machine translation principles such as Multidimensional Quality Metrics (MQM) into LLM evaluation frameworks seems promising. This cross-disciplinary approach can refine the assessment process and ensure that various pivotal attributes—especially factual accuracy—are prioritized for distinct task requirements.

Conclusion

In summary, this paper presents a critical analysis of evaluation biases existing within the domain of LLMs, proposing the Multi-Elo Rating System as an innovative solution to these challenges. The framework's success in improving evaluation transparency and accuracy provides a valuable tool for both academic and practical applications in AI developments. Furthermore, the paper invites more exhaustive inquiries toward expanding dimensions of evaluation criteria, ensuring a full encapsulation of output quality. Thus, the research establishes an important precedent for how LLM performance can be methodically and precisely assessed.