Calibrate absolute scoring scales in GPT-4-based chatbot evaluation

Establish how absolute scores on a 10-point rating scale (e.g., a score of 8) should be interpreted and compared across different evaluation scenarios when GPT-4 is used as the judge for chatbot responses, to ensure consistent, grounded assessments.

Background

The authors evaluate chatbots with GPT-4 and observe wide confidence intervals and ordering effects. They attribute some uncertainty to the lack of a clearly specified absolute scale for GPT-4’s 10-point ratings.

They explicitly note that the meaning of specific scores (e.g., an 8/10) across different scenarios is unclear, highlighting a need to calibrate or ground such scales to improve reliability and interpretability of automated evaluations.

References

We hypothesize that this uncertainty comes from the lack of clear specification of scale, e.g., it is unclear what 8 on a 10 point scale means across different scenarios.

— QLoRA: Efficient Finetuning of Quantized LLMs (2305.14314 - Dettmers et al., 2023) in Subsection "Guanaco: QLoRA trained on OASST1 is a State-of-the-art Chatbot"

Calibrate absolute scoring scales in GPT-4-based chatbot evaluation

Background

References

Related Problems