AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters (2401.06408v3)

Published 12 Jan 2024 in cs.CL

Abstract: LLMs' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.

References (78)

Citations (11)

View on Semantic Scholar

Summary

The paper demonstrates how pretraining filters inadvertently bias LLM datasets by disproportionately excluding diverse social and regional content.
Researchers curated the AboutMe dataset of 10.3M self-descriptions to assess the socio-demographic impact of standard quality and language filters.
Findings reveal that these filters favor individual pages over organizational ones, highlighting the need for more equitable and conscious data curation practices.

Overview of the Study

The development of LLMs frequently involves a crucial step known as data curation, where decisions are made to retain or exclude certain types of data. This paper presents an in-depth examination of how commonly applied "quality" and English identification filters influence the representation of webpages in LLM pretraining datasets. The researchers crafted a large dataset, named AboutMe, consisting of 10.3 million self-descriptions from website creators, revealing insights about their topical interests, identified roles, and geographical locations. The paper endeavors to shed light on the nuanced way these web text filters are possibly shaping the social and geographic diversity reflected in LLMs.

Analyzing Social Dimensions

Understanding the impact of LLM data curation necessitates a sociolinguistic approach, and the AboutMe dataset is instrumental in this analysis. The dataset, derived from the Common Crawl archives, empowers the investigation of the social dimensions of language at a socio-demographic level rarely achieved before. By honing in on self-descriptions found on "about" pages, the researchers were able to categorize web creators as individuals or organizations and tag them with relevant social roles, occupational segments, and geographical affiliations. The dataset serves as both the foundation for this paper's analysis and a potential resource for broader research in self-presentation and language variation.

The Impact of Pretraining Filters

The paper explores the consequences of applying ten different "quality" and language filters to the AboutMe dataset. Intriguingly, the research reveals that seemingly generic quality filters can inadvertently function as content filters, favoring certain topical domains while disregarding others. Moreover, English language filters exhibit a tendency to overlook English content from specific world regions, suggesting biases that tie into broader linguistic and geopolitical structures. This point becomes particularly crucial when considering the global reach and potential impact of LLMs, which serve users from varied linguistic and cultural backgrounds.

Insights and Implications

The findings of the research illuminate understated biases within LLM pretraining data curation. For instance, web content from Asia, as well as certain social roles and occupations, face higher removal rates by quality and language filters. Surprisingly, pages by individuals are more likely to be retained than those by organizations. The paper emphasizes the importance for AI developers and users to recognize the silent, yet substantial, preferences embedded within data filtering practices. By harnessing this recognition, the AI community can strive toward more conscious, equitable, and representative LLM development that honours the linguistic and social diversity of the global population. The comprehensive code, data, and resources utilized in the paper have been made available by the authors to catalyze ongoing research in pretraining data curation methods and their societal influences.

PDF Markdown

Tweets

https://twitter.com/lucy3_li/status/1747408124263518644

https://twitter.com/soldni/status/1925414792875188292

https://twitter.com/gm8xx8/status/1746739223066014063

https://twitter.com/knishimae0531/status/1746737913331912713