Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning (1912.10389v1)

Published 22 Dec 2019 in cs.LG, cs.AI, and cs.CY

Abstract: A growing body of work shows that many problems in fairness, accountability, transparency, and ethics in machine learning systems are rooted in decisions surrounding the data collection and annotation process. In spite of its fundamental nature however, data collection remains an overlooked part of the ML pipeline. In this paper, we argue that a new specialization should be formed within ML that is focused on methodologies for data collection and annotation: efforts that require institutional frameworks and procedures. Specifically for sociocultural data, parallels can be drawn from archives and libraries. Archives are the longest standing communal effort to gather human information and archive scholars have already developed the language and procedures to address and discuss many challenges pertaining to data collection such as consent, power, inclusivity, transparency, and ethics & privacy. We discuss these five key approaches in document collection practices in archives that can inform data collection in sociocultural ML. By showing data collection practices from another field, we encourage ML research to be more cognizant and systematic in data collection and draw from interdisciplinary expertise.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Eun Seo Jo (2 papers)
  2. Timnit Gebru (15 papers)
Citations (284)

Summary

An Analysis of "Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning"

The paper "Lessons from Archives: Strategies for Collecting Sociocultural Data in Machine Learning" explores the critical intersection of data collection practices in ML and archival sciences. It argues for the establishment of a dedicated subfield within ML focused on the methodological rigor for data collection and annotation, particularly towards sociocultural datasets. The authors, Eun Seo Jo and Timnit Gebru, posit that the issues of fairness, accountability, transparency, and ethics (FATE) in ML systems are intimately tied to inadequate data collection practices. They propose that the well-established practices in archival studies offer valuable insights that could reshape data collection in ML.

Key Contributions and Insights

The authors emphasize several parallels between archival sciences and ML data practices, suggesting that some of the robust frameworks established in archives can address existing gaps in ML data collection. They draw on five principal areas in archival science that can be integrated into the ML pipeline: inclusivity, consent, power dynamics, transparency, and ethics content privacy. Specifically, they illustrate how these principles are ingrained in archival practices through mission statements, community archives, data consortia, appraisal records, and ethical codes of conduct, all of which are areas currently underdeveloped in the ML domain.

  1. Inclusivity and Mission Statements: Archives employ mission statements to guide the gathering of inclusive datasets. ML models can adopt a similar approach to expand beyond the constraints of convenience and availability, actively seeking diversity in the data they utilize. This can improve the representation of minority and underrepresented groups in ML datasets.
  2. Consent and Community Archives: Drawing analogies with participatory and community archives, the authors highlight the importance of enabling data subjects to contribute willingly and influence how they are represented. This is particularly important in the context of sensitive sociocultural data, where misrepresentation can perpetuate harmful stereotypes.
  3. Power: Data Consortia: The establishment of data consortia, akin to library consortia, could alleviate the burdens of data collection. Shared resources and standardized practices across institutions may democratize data access, allowing smaller players to partake in the benefits of large-scale data projects.
  4. Transparency: Appraisal Records and Committees: Archives maintain detailed records of their processes, which informs future generations and ensures accountability. ML can benefit from adopting such documentation standards, enhancing understanding and oversight of data provenance and transformation.
  5. Ethics and Privacy Standards: A professional framework endorsing ethical data collection could protect data providers and maintain public trust in ML technologies. The process involves adopting comprehensive codes of conduct similar to those in archival communities, potentially supported by independent oversight bodies.

Implications and Future Directions

The paper underscores the importance of interdisciplinary collaboration in refining the ML data pipeline. By integrating archival methodologies, ML can better address and mitigate ethical issues surrounding data use. The creation of a specialized field within ML dedicated to data ethics and collection would institutionalize these practices. However, it also acknowledges inherent challenges such as scaling and resource allocation, which require careful consideration.

The implications of such integration are broad. For ML practitioners, the proposed frameworks can guide the construction of datasets that are not only robust in performance but also just and equitable in impact. For researchers, the interdisciplinary approach encourages engagement with diverse fields, enriching the methodological toolkit available for data collection in ML.

In conclusion, this paper advocates a balanced approach wherein ML can draw on established archival practices to build more ethically sound systems. It initiates a conversation that can lead to the development of more sustainable and fair ML ecosystems, ensuring technology serves a wider and more inclusive audience. This discourse paves the way for future research to explore not only how these archival lessons can be practically implemented in ML but also how they might evolve to address emerging challenges in AI.