Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bringing the People Back In: Contesting Benchmark Machine Learning Datasets (2007.07399v1)

Published 14 Jul 2020 in cs.CY

Abstract: In response to algorithmic unfairness embedded in sociotechnical systems, significant attention has been focused on the contents of machine learning datasets which have revealed biases towards white, cisgender, male, and Western data subjects. In contrast, comparatively less attention has been paid to the histories, values, and norms embedded in such datasets. In this work, we outline a research program - a genealogy of machine learning data - for investigating how and why these datasets have been created, what and whose values influence the choices of data to collect, the contextual and contingent conditions of their creation. We describe the ways in which benchmark datasets in machine learning operate as infrastructure and pose four research questions for these datasets. This interrogation forces us to "bring the people back in" by aiding us in understanding the labor embedded in dataset construction, and thereby presenting new avenues of contestation for other researchers encountering the data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Alex Hanna (11 papers)
  2. Razvan Amironesei (3 papers)
  3. Andrew Smart (20 papers)
  4. Hilary Nicole (1 paper)
  5. Morgan Klaus Scheuerman (4 papers)
  6. Remi Denton (10 papers)
Citations (90)