Papers
Topics
Authors
Recent
2000 character limit reached

The Reasonable Effectiveness of Diverse Evaluation Data

Published 23 Jan 2023 in cs.HC | (2301.09406v1)

Abstract: In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development -- specifically human evaluation of generative models -- on the backdrop of growing work on sociotechnical AI evaluations.

Citations (8)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.