Dice Question Streamline Icon: https://streamlinehq.com

Validate correlation between GPT-4 ratings and human judgments for chatbot evaluation

Determine whether GPT-4-based evaluation ratings reliably correlate with human judgments when assessing chatbot performance across tasks and datasets, and quantify the strength and conditions of such correlations.

Information Square Streamline Icon: https://streamlinehq.com

Background

Although the paper conducts both GPT-4-based and human evaluations and reports moderate agreement at the system level, the authors explicitly note that the general reliability of GPT-4 ratings to assess chatbot performance has yet to be proven to correlate with human judgments.

This highlights a broader, ongoing question about the validity and robustness of model-based evaluators as proxies for human assessments.

References

While recent work indicates generative models can be effectively employed for system evaluations, the reliability GPT-4 ratings to assess chatbot performance is, to our knowledge, yet to be proven to correlate with human judgments.

QLoRA: Efficient Finetuning of Quantized LLMs (2305.14314 - Dettmers et al., 2023) in Subsection "Human Evaluation"