An Examination of Disciplinary Practices and Values in Computer Vision Dataset Development
The paper "Do Datasets Have Politics? Disciplinary Values in Computer Vision Dataset Development" by Scheuerman, Denton, and Hanna explores the underlying values and practices prevalent in the generation of datasets used within computer vision research. As a crucial element of machine learning, data serves not only as a medium for training models but also reflects the values that influence the broader field of computer vision. This paper analyzes how dataset documentation and construction reveal implicit disciplinary values and identifies areas where these values conflict with social computing ideals.
Key Themes and Findings
The authors conduct a substantial content analysis of 114 dataset publications from a pool of approximately 500 datasets. This paper highlights several themes related to data curation, annotation, and dissemination in computer vision, drawing attention particularly to the values prioritized in dataset documentation:
- Efficiency vs. Care: Dataset developers prioritize efficiency, seeking ways to collect, annotate, and release data rapidly and cost-effectively. This preference often overshadows considerations for the ethical treatment of both annotators and data subjects, including those surrounding consent and compensation. The stark devaluation of data labor and ethics signifies a concomitant disregard for care in favor of expediency.
- Universality vs. Contextuality: The paper observes a widespread inclination toward achieving universal applicability and comprehensive coverage in datasets. However, the authors argue this often detracts from understanding the contextual nuances essential for robust and fair deployment of computer vision technologies. This oversight can cause misalignment between the dataset's intended usage and the ecological validity required for trustworthy AI applications.
- Impartiality vs. Positionality: Striving for objectivity often leads dataset developers to aim for impartiality in data collection and annotation, frequently at the cost of ignoring how their positionality can influence the dataset. The reluctance to document subjective experiences and biases in data curation diminishes the opportunity for nuanced and fair representations, which could otherwise elevate the utility and ethical standing of computer vision datasets.
- Model Work vs. Data Work: While datasets fuel the development of machine learning models, there is an apparent preference for publishing novel algorithms rather than dwelling on data intricacies. This bias leads to scant attention on documenting, maintaining, and ethically stewarding datasets, increasing technical debt and hindering reproducibility.
Practical Implications and Future Directions
The findings suggest that to improve the ethical, reliable, and applicable use of datasets, there must be a concerted effort to rebalance these values:
- Emphasizing dataset transparency and documentation, much like model benchmarking, can ensure that the datasets themselves become robust, replicable foundations for AI systems.
- Encouraging the use of data licenses and deposit systems like DOIs or institutional repositories can mitigate data accessibility issues and align more closely with open science practices.
- Developing disciplines within computer vision that focus explicitly on data cultivation and stewardship, akin to software engineering practices, could better legitimize data work and encourage ethical curation.
- Embracing methods like reflexivity and positionality statements can help researchers critically engage with their own biases, harness greater representation in datasets, and align datasets meaningfully with cultural and geographic contexts.
The authors advocate for such nuanced data practices to ensure the development of computer vision models does not merely innovate technically but also reflect a deeper commitment to fairness and ethical soundness in AI systems. This paper thus contributes significantly to ongoing discourse on integrating sociotechnical insights into machine learning practices, particularly within high-impact domains like computer vision.