Calibrate absolute scoring scales in GPT-4-based chatbot evaluation
Establish how absolute scores on a 10-point rating scale (e.g., a score of 8) should be interpreted and compared across different evaluation scenarios when GPT-4 is used as the judge for chatbot responses, to ensure consistent, grounded assessments.
References
We hypothesize that this uncertainty comes from the lack of clear specification of scale, e.g., it is unclear what 8 on a 10 point scale means across different scenarios.
— QLoRA: Efficient Finetuning of Quantized LLMs
(2305.14314 - Dettmers et al., 2023) in Subsection "Guanaco: QLoRA trained on OASST1 is a State-of-the-art Chatbot"