2000 character limit reached
The Reasonable Effectiveness of Diverse Evaluation Data (2301.09406v1)
Published 23 Jan 2023 in cs.HC
Abstract: In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development -- specifically human evaluation of generative models -- on the backdrop of growing work on sociotechnical AI evaluations.
- Lora Aroyo (35 papers)
- Christopher Homan (9 papers)
- Vinodkumar Prabhakaran (48 papers)
- Alex Taylor (9 papers)
- Ding Wang (71 papers)
- Mark Diaz (10 papers)