Validate correlation between GPT-4 ratings and human judgments for chatbot evaluation
Determine whether GPT-4-based evaluation ratings reliably correlate with human judgments when assessing chatbot performance across tasks and datasets, and quantify the strength and conditions of such correlations.
References
While recent work indicates generative models can be effectively employed for system evaluations, the reliability GPT-4 ratings to assess chatbot performance is, to our knowledge, yet to be proven to correlate with human judgments.
— QLoRA: Efficient Finetuning of Quantized LLMs
(2305.14314 - Dettmers et al., 2023) in Subsection "Human Evaluation"