Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development (2108.04308v2)

Published 9 Aug 2021 in cs.CV and cs.HC

Abstract: Data is a crucial component of machine learning. The field is reliant on data to train, validate, and test models. With increased technical capabilities, machine learning research has boomed in both academic and industry settings, and one major focus has been on computer vision. Computer vision is a popular domain of machine learning increasingly pertinent to real-world applications, from facial recognition in policing to object detection for autonomous vehicles. Given computer vision's propensity to shape machine learning research and impact human life, we seek to understand disciplinary practices around dataset documentation - how data is collected, curated, annotated, and packaged into datasets for computer vision researchers and practitioners to use for model tuning and development. Specifically, we examine what dataset documentation communicates about the underlying values of vision data and the larger practices and goals of computer vision as a field. To conduct this study, we collected a corpus of about 500 computer vision datasets, from which we sampled 114 dataset publications across different vision tasks. Through both a structured and thematic content analysis, we document a number of values around accepted data practices, what makes desirable data, and the treatment of humans in the dataset construction process. We discuss how computer vision datasets authors value efficiency at the expense of care; universality at the expense of contextuality; impartiality at the expense of positionality; and model work at the expense of data work. Many of the silenced values we identify sit in opposition with social computing practices. We conclude with suggestions on how to better incorporate silenced values into the dataset creation and curation process.

PDF Abstract

An Examination of Disciplinary Practices and Values in Computer Vision Dataset Development

The paper "Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development" by Scheuerman, Denton, and Hanna explores the underlying values and practices prevalent in the generation of datasets used within computer vision research. As a crucial element of machine learning, data serves not only as a medium for training models but also reflects the values that influence the broader field of computer vision. This paper analyzes how dataset documentation and construction reveal implicit disciplinary values and identifies areas where these values conflict with social computing ideals.

Key Themes and Findings

The authors conduct a substantial content analysis of 114 dataset publications from a pool of approximately 500 datasets. This paper highlights several themes related to data curation, annotation, and dissemination in computer vision, drawing attention particularly to the values prioritized in dataset documentation:

Efficiency vs. Care: Dataset developers prioritize efficiency, seeking ways to collect, annotate, and release data rapidly and cost-effectively. This preference often overshadows considerations for the ethical treatment of both annotators and data subjects, including those surrounding consent and compensation. The stark devaluation of data labor and ethics signifies a concomitant disregard for care in favor of expediency.
Universality vs. Contextuality: The paper observes a widespread inclination toward achieving universal applicability and comprehensive coverage in datasets. However, the authors argue this often detracts from understanding the contextual nuances essential for robust and fair deployment of computer vision technologies. This oversight can cause misalignment between the dataset's intended usage and the ecological validity required for trustworthy AI applications.
Impartiality vs. Positionality: Striving for objectivity often leads dataset developers to aim for impartiality in data collection and annotation, frequently at the cost of ignoring how their positionality can influence the dataset. The reluctance to document subjective experiences and biases in data curation diminishes the opportunity for nuanced and fair representations, which could otherwise elevate the utility and ethical standing of computer vision datasets.
Model Work vs. Data Work: While datasets fuel the development of machine learning models, there is an apparent preference for publishing novel algorithms rather than dwelling on data intricacies. This bias leads to scant attention on documenting, maintaining, and ethically stewarding datasets, increasing technical debt and hindering reproducibility.

Practical Implications and Future Directions

The findings suggest that to improve the ethical, reliable, and applicable use of datasets, there must be a concerted effort to rebalance these values:

Emphasizing dataset transparency and documentation, much like model benchmarking, can ensure that the datasets themselves become robust, replicable foundations for AI systems.
Encouraging the use of data licenses and deposit systems like DOIs or institutional repositories can mitigate data accessibility issues and align more closely with open science practices.
Developing disciplines within computer vision that focus explicitly on data cultivation and stewardship, akin to software engineering practices, could better legitimize data work and encourage ethical curation.
Embracing methods like reflexivity and positionality statements can help researchers critically engage with their own biases, harness greater representation in datasets, and align datasets meaningfully with cultural and geographic contexts.

The authors advocate for such nuanced data practices to ensure the development of computer vision models does not merely innovate technically but also reflect a deeper commitment to fairness and ethical soundness in AI systems. This paper thus contributes significantly to ongoing discourse on integrating sociotechnical insights into machine learning practices, particularly within high-impact domains like computer vision.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Morgan Klaus Scheuerman (4 papers)
Emily Denton (18 papers)
Alex Hanna (11 papers)

Citations (185)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/ShayneRedford/status/1778449993960919088