- The paper demonstrates that reliance on human feedback preference scores introduces biases that obscure factuality errors and inconsistencies.
- It applies regression analysis across error categories, uncovering confounding effects such as the impact of assertiveness in model outputs.
- The findings suggest integrating more objective metrics with human judgments to enhance the reliability of LLM evaluations.
Evaluating Human Feedback as a Metric in LLMs
The paper "Human Feedback is not Gold Standard" provides a critical analysis of the prevalent reliance on human feedback for both training and evaluating LLMs. The authors, Tom Hosking, Phil Blunsom, and Max Bartolo, hypothesize that while human preference scores offer a simplistic metric for content evaluation, they may introduce biases and often misrepresent specific error criteria such as factuality and inconsistency. They investigate whether human feedback scores, affected by confounders like assertiveness and complexity, should remain a predominant metric in training LLMs.
Human feedback has emerged as a widely accepted method for evaluating LLM outputs, often translating into a single preference score. This simplification, however, potentially ignores the nuanced dimensions of output quality. The paper categorizes errors into several key types: harmfulness, fluency, scope, repetition, refusal, formatting, relevance, factuality, inconsistency, and contradiction. Using this framework, the authors conduct experiments with diverse model outputs evaluated by crowdworkers according to these criteria.
Through comprehensive experiments across different datasets and models, the paper reveals that preference scores do not adequately reflect factuality and inconsistency in LLM outputs. This misalignment suggests that preference scores, while efficient, may not be a fully reliable metric for capturing detailed error types relevant to users. The authors utilize regression analysis on error markings and overall scores, demonstrating that crowdworkers significantly underrate factuality errors while over-relying on superficial attributes like assertiveness of the output.
Tackling potential biases in human evaluation, the paper assesses the confounding effects of assertiveness and complexity in model outputs. Annotator judgments lean favorably towards assertive outputs, and such responses correlate with higher preference ratings regardless of factual accuracy. Consequently, LLMs trained primarily on human preference scores are prone to produce more assertive outputs, risking the prioritization of stylistic compliance over content accuracy.
The implications of these findings extend to the development of RLHF (Reinforcement Learning from Human Feedback) models. The authors observe that models fine-tuned using RLHF, such as Llama 2, tend to showcase increased assertiveness for a given level of perceived quality compared to non-RLHF counterparts. This implies a potential misalignment between system optimization goals and user-desired outcomes, originating from the biases inherent in human annotations.
In summary, the paper argues for a reevaluation of human preference as a gold standard in LLM assessments, urging the research community to consider human feedback as a semi-reliable proxy that may leave critical dimensions underrepresented. Future advancements in AI could benefit from diversifying training objectives beyond human preferences, perhaps by integrating more objective performance metrics that comprehensively capture the complex domain of model outputs. Leveraging a hybrid approach, combining sophisticated algorithmic evaluations with human insights, might elevate the efficacy and robustness of LLM assessments, aligning them closely with actual practical application demands.