Data and its (dis)contents: A survey of dataset development and use in machine learning research

Published 9 Dec 2020 in cs.LG | (2012.05345v1)

Abstract: Datasets have played a foundational role in the advancement of machine learning research. They form the basis for the models we design and deploy, as well as our primary medium for benchmarking and evaluation. Furthermore, the ways in which we collect, construct and share these datasets inform the kinds of problems the field pursues and the methods explored in algorithm development. However, recent work from a breadth of perspectives has revealed the limitations of predominant practices in dataset collection and use. In this paper, we survey the many concerns raised about the way we collect and use data in machine learning and advocate that a more cautious and thorough understanding of data is necessary to address several of the practical and ethical issues of the field.

Abstract PDF Upgrade to Chat

Citations (474)

View on Semantic Scholar

Summary

The paper rigorously surveys dataset practices in ML and exposes significant representational biases and ethical concerns.
The paper highlights how spurious data artifacts lead models to rely on superficial correlations rather than genuine reasoning.
The paper critiques inadequate documentation and ethical oversight, urging a shift toward more responsible, context-aware dataset development.

An Analysis of Dataset Practices in Machine Learning

The paper "Data and its (dis)contents: A survey of dataset development and use in machine learning research" by Amandalynne Paullada and colleagues provides a critical examination of datasets, which are pivotal in ML research. The authors meticulously dissect both the strengths and pitfalls of current dataset practices in ML, emphasizing the scientific and ethical implications of data usage.

Key Points

The paper discusses the foundational role of datasets, highlighting their significance in both model development and evaluation. However, it reveals several concerns regarding dataset collection and use:

Represenational Bias: A major issue identified is the under-representation of diverse sociodemographic groups. This is evident in datasets that skew towards Western-centric data or under-represent certain racial and gender groups.
Spurious Data Artifacts: The research points to the reliance of ML models on artificial cues rather than genuine task-related reasoning. This is often due to unintentional correlations in datasets that models exploit.
Legitimization of Unjust Tasks: The paper argues that some datasets erroneously legitimize pseudoscientific tasks, such as predicting personal traits from facial images, which are not only ethically dubious but also scientifically unsound.
Annotation and Documentation: The paper identifies a lack of rigorous documentation and annotation standards, contributing to the inadequate representation of real-world tasks, thus hindering replicability.

Challenges in Current Approaches

Efforts to mitigate dataset biases and enhance model robustness, while well-intentioned, often fail to address the foundational issues. The researchers critique approaches that rely on adversarial datasets or bias mitigation techniques as insufficient, since they may address only superficial dataset flaws without tackling underlying ethical or contextual shortcomings.

Dataset Culture

The research also scrutinizes the broader ML culture, including:

Benchmarking Practices: While benchmarks are crucial for gauging progress, a hyper-focus on benchmark-driven metrics can stifle innovative approaches and overlook context-specific applications.
Data Management: There is a noted gap in secure data storage and ethical dissemination, especially concerning privacy and informed consent, which highlights the need for comprehensive data governance.
Legal and Ethical Concerns: The proliferation of datasets scraped from the internet raises issues of copyright, privacy, and misuse, necessitating a re-evaluation of legal frameworks.

Implications and Future Directions

The paper advocates for a paradigm shift toward a dataset culture that prioritizes ethical considerations, contextual relevance, and collaborative development practices. Future challenges in AI could benefit from smaller, more carefully curated datasets that focus on genuine task representation and equitable deployment.

Such a shift may inspire new machine learning methods beyond scale-dependent approaches, fostering a field more attuned to human impacts and ethical obligations. The analytical lens offered by this paper underscores the need for interdisciplinary collaboration and a more conscientious approach to dataset design and use.

By addressing these multifaceted issues, researchers and practitioners can better align their work with societal values and mitigate potential harms associated with ML technologies.