Human Feedback is not Gold Standard (2309.16349v2)

Published 28 Sep 2023 in cs.CL

Abstract: Human feedback has become the de facto standard for evaluating the performance of LLMs, and is increasingly being used as a training objective. However, it is not clear which properties of a generated output this single `preference' score captures. We hypothesise that preference scores are subjective and open to undesirable biases. We critically analyse the use of human feedback for both training and evaluation, to verify whether it fully captures a range of crucial error criteria. We find that while preference scores have fairly good coverage, they under-represent important aspects like factuality. We further hypothesise that both preference scores and error annotation may be affected by confounders, and leverage instruction-tuned models to generate outputs that vary along two possible confounding dimensions: assertiveness and complexity. We find that the assertiveness of an output skews the perceived rate of factuality errors, indicating that human annotations are not a fully reliable evaluation metric or training objective. Finally, we offer preliminary evidence that using human feedback as a training objective disproportionately increases the assertiveness of model outputs. We encourage future work to carefully consider whether preference scores are well aligned with the desired objective.

Citations (44)

View on Semantic Scholar

Summary

The paper demonstrates that reliance on human feedback preference scores introduces biases that obscure factuality errors and inconsistencies.
It applies regression analysis across error categories, uncovering confounding effects such as the impact of assertiveness in model outputs.
The findings suggest integrating more objective metrics with human judgments to enhance the reliability of LLM evaluations.

Evaluating Human Feedback as a Metric in LLMs

The paper "Human Feedback is not Gold Standard" provides a critical analysis of the prevalent reliance on human feedback for both training and evaluating LLMs. The authors, Tom Hosking, Phil Blunsom, and Max Bartolo, hypothesize that while human preference scores offer a simplistic metric for content evaluation, they may introduce biases and often misrepresent specific error criteria such as factuality and inconsistency. They investigate whether human feedback scores, affected by confounders like assertiveness and complexity, should remain a predominant metric in training LLMs.

Human feedback has emerged as a widely accepted method for evaluating LLM outputs, often translating into a single preference score. This simplification, however, potentially ignores the nuanced dimensions of output quality. The paper categorizes errors into several key types: harmfulness, fluency, scope, repetition, refusal, formatting, relevance, factuality, inconsistency, and contradiction. Using this framework, the authors conduct experiments with diverse model outputs evaluated by crowdworkers according to these criteria.

Through comprehensive experiments across different datasets and models, the paper reveals that preference scores do not adequately reflect factuality and inconsistency in LLM outputs. This misalignment suggests that preference scores, while efficient, may not be a fully reliable metric for capturing detailed error types relevant to users. The authors utilize regression analysis on error markings and overall scores, demonstrating that crowdworkers significantly underrate factuality errors while over-relying on superficial attributes like assertiveness of the output.

Tackling potential biases in human evaluation, the paper assesses the confounding effects of assertiveness and complexity in model outputs. Annotator judgments lean favorably towards assertive outputs, and such responses correlate with higher preference ratings regardless of factual accuracy. Consequently, LLMs trained primarily on human preference scores are prone to produce more assertive outputs, risking the prioritization of stylistic compliance over content accuracy.

The implications of these findings extend to the development of RLHF (Reinforcement Learning from Human Feedback) models. The authors observe that models fine-tuned using RLHF, such as Llama 2, tend to showcase increased assertiveness for a given level of perceived quality compared to non-RLHF counterparts. This implies a potential misalignment between system optimization goals and user-desired outcomes, originating from the biases inherent in human annotations.

In summary, the paper argues for a reevaluation of human preference as a gold standard in LLM assessments, urging the research community to consider human feedback as a semi-reliable proxy that may leave critical dimensions underrepresented. Future advancements in AI could benefit from diversifying training objectives beyond human preferences, perhaps by integrating more objective performance metrics that comprehensively capture the complex domain of model outputs. Leveraging a hybrid approach, combining sophisticated algorithmic evaluations with human insights, might elevate the efficacy and robustness of LLM assessments, aligning them closely with actual practical application demands.

PDF Markdown

Related Papers

GitHub

GitHub - cohere-ai/human-feedback-paper: Code and data from the paper 'Human Feedback is not Gold Standard' (19 stars)

Tweets

https://twitter.com/SamuelAlbanie/status/1760922786564870235

https://twitter.com/max_nlp/status/1788093474736160896

https://twitter.com/tomhosking/status/1787882643960271274

https://twitter.com/tomhosking/status/1837048360064139657

https://twitter.com/CohereForAI/status/1788071459979444511

https://twitter.com/tomhosking/status/1749387371735375928

YouTube

Show All Videos