Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Taxonomy of Challenges to Curating Fair Datasets (2406.06407v2)

Published 10 Jun 2024 in cs.LG and cs.CY
A Taxonomy of Challenges to Curating Fair Datasets

Abstract: Despite extensive efforts to create fairer ML datasets, there remains a limited understanding of the practical aspects of dataset curation. Drawing from interviews with 30 ML dataset curators, we present a comprehensive taxonomy of the challenges and trade-offs encountered throughout the dataset curation lifecycle. Our findings underscore overarching issues within the broader fairness landscape that impact data curation. We conclude with recommendations aimed at fostering systemic changes to better facilitate fair dataset curation practices.

A Taxonomy of Challenges to Curating Fair Datasets

The paper "A Taxonomy of Challenges to Curating Fair Datasets" by Dora Zhao et al. methodically explores the practical difficulties encountered in the curation lifecycle of ML datasets aimed at achieving fairness. Through qualitative interviews with 30 dataset curators from both academia and industry, the authors present an intricate taxonomy of the challenges that permeate the process of fair dataset curation, providing a solid empirical foundation that complements theoretical guidelines on the subject. The paper's contributions not only highlight the nuanced trade-offs and obstacles faced by curators but also offer targeted recommendations for fostering systemic change to promote fairer data practices in ML.

Dataset Lifecycle Challenges

The paper structures its exploration into the phases of the dataset lifecycle: requirements, design, implementation, evaluation, and maintenance. Each phase reveals distinct challenges that affect the pursuit of fairness from different angles.

Requirements

In the requirements phase, challenges include defining the scope of datasets and determining appropriate fairness definitions. Curators struggle with balancing dataset utility and fairness, often constrained by practical limitations such as data availability or cost. The paper emphasizes that fairness is highly contextual, influenced by factors like domain, task, and cultural context. The multiplicity of fairness definitions in existing literature adds complexity, forcing curators to make trade-offs that may compromise certain dimensions of fairness.

Design

Challenges in the design phase focus on creating fair taxonomies and operationalizing dataset requirements. The inherent unfairness in categorization and inadequacies in existing domain taxonomies pose significant hurdles. Data availability constraints further complicate the design of taxonomies, compelling curators to use coarser, less ideal categories due to practical considerations.

Implementation

The implementation phase covers data collection and annotation, each presenting unique challenges. Diverse data availability is a common issue, with curators often struggling to source data from underrepresented groups. Similarly, finding data collectors and annotators with the requisite expertise and from diverse backgrounds is a major challenge. The authors also discuss the intricacies of fair labor practices, emphasizing the need for equitable compensation and transparent handling of data workers by third-party vendors.

Evaluation

In the evaluation phase, issues with traditional paradigms such as majority voting and annotator agreement metrics arise, especially when these metrics conflict with the goal of capturing diverse perspectives. The lack of comparable benchmarking datasets and the challenge of evaluating fairness as an immeasurable construct are highlighted, complicating efforts to validate datasets.

Maintenance

The maintenance phase deals with ensuring the ongoing utility of datasets and managing their release. Challenges include the instability of digital data sources and inadequate traceability mechanisms, which make it difficult to monitor dataset usage and prevent misuse. The authors advocate for standardized methods to check data availability and protocols for replacing deprecated instances while maintaining dataset composition.

Broader Landscape of Fairness Challenges

Beyond the dataset lifecycle, the paper identifies overarching challenges that span multiple levels: individual, discipline, organization, regulatory, and socio-political.

Individual Level

Individual biases and positionality are inherent in dataset curation, with each contributor's unique perspective inevitably influencing the dataset. The authors recommend reflective practices and pre-registration systems to mitigate such biases.

Discipline Level

The undervaluation of dataset work in ML and the lack of disciplinary incentives for fair data practices are significant barriers. The authors call for more recognition of dataset curation efforts and the inclusion of fairness-oriented tracks in major ML conferences.

Organization Level

Resource constraints, both monetary and time-related, are a major organizational challenge. The authors criticize the superficial promotion of fairness by organizations lacking substantive integration of fairness into practices. They suggest that watchdog organizations could help keep companies accountable.

Regulatory Level

Navigating different legal frameworks and maintaining regulatory compliance are notable challenges. The authors recommend that ML venues include ethical review processes for legal implications and that clear guidelines be established for legal compliance in dataset collection.

Socio-Political Level

The evolving and contested nature of fairness presents a persistent challenge. The authors underscore the need for ongoing revisions to datasets and transparency in documenting the rationale behind fairness definitions. The disparity in power among various institutions further complicates the pursuit of fair datasets.

Implications and Future Directions

The paper's recommendations underscore the importance of systemic changes to facilitate fair dataset curation. These include creating more flexible and robust data practices, incorporating community feedback, establishing transparent and equitable labor practices, and implementing effective traceability mechanisms. The authors highlight the necessity for interdisciplinary communication and the adoption of participatory approaches to better reflect the diverse needs and experiences of impacted communities.

In conclusion, this paper methodically dissects the multifaceted challenges that dataset curators face in the pursuit of fairness, providing a nuanced understanding that bridges theoretical and practical aspects. The detailed taxonomy and actionable recommendations offer a comprehensive framework for advancing fair dataset curation practices, potentially influencing future developments in AI by promoting more equitable and inclusive data-driven models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Dora Zhao (17 papers)
  2. Morgan Klaus Scheuerman (4 papers)
  3. Pooja Chitre (1 paper)
  4. Jerone T. A. Andrews (11 papers)
  5. Georgia Panagiotidou (4 papers)
  6. Shawn Walker (5 papers)
  7. Kathleen H. Pine (1 paper)
  8. Alice Xiang (28 papers)