Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The Reasonable Effectiveness of Diverse Evaluation Data (2301.09406v1)

Published 23 Jan 2023 in cs.HC

Abstract: In this paper, we present findings from an semi-experimental exploration of rater diversity and its influence on safety annotations of conversations generated by humans talking to a generative AI-chat bot. We find significant differences in judgments produced by raters from different geographic regions and annotation platforms, and correlate these perspectives with demographic sub-groups. Our work helps define best practices in model development -- specifically human evaluation of generative models -- on the backdrop of growing work on sociotechnical AI evaluations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Lora Aroyo (35 papers)
  2. Christopher Homan (9 papers)
  3. Vinodkumar Prabhakaran (48 papers)
  4. Alex Taylor (9 papers)
  5. Ding Wang (71 papers)
  6. Mark Diaz (10 papers)
Citations (8)