Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AboutMe: Using Self-Descriptions in Webpages to Document the Effects of English Pretraining Data Filters (2401.06408v3)

Published 12 Jan 2024 in cs.CL

Abstract: LLMs' (LLMs) abilities are drawn from their pretraining data, and model development begins with data curation. However, decisions around what data is retained or removed during this initial stage are under-scrutinized. In our work, we ground web text, which is a popular pretraining data source, to its social and geographic contexts. We create a new dataset of 10.3 million self-descriptions of website creators, and extract information about who they are and where they are from: their topical interests, social roles, and geographic affiliations. Then, we conduct the first study investigating how ten "quality" and English language identification (langID) filters affect webpages that vary along these social dimensions. Our experiments illuminate a range of implicit preferences in data curation: we show that some quality classifiers act like topical domain filters, and langID can overlook English content from some regions of the world. Overall, we hope that our work will encourage a new line of research on pretraining data curation practices and its social implications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Asif Agha. 2005. Registers of Language, chapter 2. John Wiley & Sons, Ltd.
  2. Using large language models to simulate multiple humans and replicate human subject studies. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  3. What we can’t measure, we can’t understand: Challenges to demographic data procurement in the pursuit of fairness. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’21, page 249–260, New York, NY, USA. Association for Computing Machinery.
  4. Out of one, many: Using language models to simulate human samples. Political Analysis, 31(3):337–351.
  5. Which humans?
  6. Megawika: Millions of reports and their sources across 50 diverse languages.
  7. Grant Blank. 2013. Who creates content? Information, Communication & Society, 16(4):590–612.
  8. Systematic inequalities in language technology performance across the world’s languages. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5486–5505, Dublin, Ireland. Association for Computational Linguistics.
  9. Language (technology) is power: A critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5454–5476, Online. Association for Computational Linguistics.
  10. Language models are few-shot learners.
  11. Amy Bruckman. 2002. Studying the amateur artist: A perspective on disguising data collected in human subjects research on the internet. Ethics and Information Technology, 4:217–231.
  12. Yang Trista Cao and Hal Daumé III. 2020. Toward gender-inclusive coreference resolution. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4568–4595, Online. Association for Computational Linguistics.
  13. Language ID in the wild: Unexpected challenges on the path to a thousand-language web text corpus. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6588–6608, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  14. Speak, memory: An archaeology of books known to ChatGPT/GPT-4. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7312–7327, Singapore. Association for Computational Linguistics.
  15. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  16. Together Computer. 2023. RedPajama: an open dataset for training large language models.
  17. Paul Cook and Laurel J Brinton. 2017. Building and evaluating web corpora representing national varieties of English. Language Resources and Evaluation, 51:643–662.
  18. Alexander Davis. 2018. India and the Anglosphere: Race, identity and hierarchy in international relations. Routledge.
  19. Harms of gender exclusivity and challenges in non-binary representation in language technologies. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1968–1994, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  20. Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  21. Towards measuring the representation of subjective global opinions in language models. arXiv preprint arXiv:2306.16388.
  22. Jacob Eisenstein. 2013. What to do about bad language on the internet. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 359–369, Atlanta, Georgia. Association for Computational Linguistics.
  23. Leo Gao. 2021. An empirical exploration in quality filtering of text data. CoRR, abs/2109.00698.
  24. The Pile: An 800GB dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  25. Datasheets for datasets. Communications of the ACM, 64(12):86–92.
  26. Gemini: A family of highly capable multimodal models.
  27. Demystifying prompts in language models via perplexity estimation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10136–10148, Singapore. Association for Computational Linguistics.
  28. H.P. Grice. 1975. Logic and conversation. Syntax and Semantics, 3:41–48.
  29. Whose language counts as high quality? Measuring language ideologies in text data selection. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2562–2580, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  30. Scaling expert language models with unsupervised domain discovery. arXiv preprint arXiv:2303.14177.
  31. Don’t stop pretraining: Adapt language models to domains and tasks. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8342–8360, Online. Association for Computational Linguistics.
  32. Andrew Halterman. 2023. Mordecai 3: A neural geoparser and event geocoder. arXiv preprint arXiv:2303.13675.
  33. Training compute-optimal large language models.
  34. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, page 1–16, New York, NY, USA. Association for Computing Machinery.
  35. Occupational prestige: The status component of socioeconomic status.
  36. The ghost in the machine has an American accent: value conflict in GPT-3.
  37. The state and fate of linguistic diversity and inclusion in the NLP world. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 6282–6293, Online. Association for Computational Linguistics.
  38. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651.
  39. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  40. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR.
  41. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  42. The BigScience ROOTS corpus: A 1.6 TB composite multilingual dataset. Advances in Neural Information Processing Systems, 35:31809–31826.
  43. Welcome to the modern world of pronouns: Identity-inclusive natural language processing beyond gender. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1221–1232, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  44. Deduplicating training data makes language models better. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8424–8445, Dublin, Ireland. Association for Computational Linguistics.
  45. A pretrainer’s guide to training data: Measuring the effects of data age, domain coverage, quality, & toxicity. arXiv preprint arXiv:2305.13169.
  46. Words as gatekeepers: Measuring discipline-specific terms and meanings in scholarly publications. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6929–6947, Toronto, Canada. Association for Computational Linguistics.
  47. Introduction to Information Retrieval. Cambridge University Press, USA.
  48. When less is more: Investigating data pruning for pretraining LLMs at scale. arXiv preprint arXiv:2309.04564.
  49. GPTeach: Interactive TA training with GPT-based students. In Proceedings of the Tenth ACM Conference on Learning @ Scale, L@S ’23, page 226–236, New York, NY, USA. Association for Computing Machinery.
  50. “The sum of all human knowledge”: A systematic review of scholarly research on the content of Wikipedia. Journal of the Association for Information Science and Technology, 66(2):219–245.
  51. Scaling data-constrained language models. arXiv preprint arXiv:2305.16264.
  52. Computational sociolinguistics: A survey. Computational linguistics, 42(3):537–593.
  53. OpenAI. 2023. Gpt-4 technical report.
  54. “I’m fully who I am”: Towards centering transgender and non-binary voices to measure biases in open language generation. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23, page 1246–1266, New York, NY, USA. Association for Computing Machinery.
  55. A method to analyze multiple social identities in Twitter bios. Proc. ACM Hum.-Comput. Interact., 5(CSCW2).
  56. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  57. Why are more African countries adopting english as an official language. In African Studies Association Annual Conference, volume 23.
  58. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  59. Scaling language models: Methods, analysis & insights from training Gopher. arXiv preprint arXiv:2112.11446.
  60. Exploring the limits of transfer learning with a unified text-to-text transformer.
  61. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  62. Compact Language Detector v3 (CLD3). https://github.com/google/cld3?tab=readme-ov-file#credits.
  63. Philip Seargeant and Caroline Tagg. 2011. English on the internet and a ‘post-varieties’ approach to language. World Englishes, 30(4):496–514.
  64. Barbara Seidlhofer. 2005. English as a lingua franca. ELT Journal, 59(4):339–341.
  65. Nakatani Shuyo. 2014. langdetect. https://github.com/Mimino666/langdetect.
  66. Dick Sites. 2013. Compact Language Detector 2. https://github.com/CLD2Owners/cld2.
  67. Dolma: An Open Corpus of Three Trillion Tokens for Language Model Pretraining Research. arXiv preprint.
  68. Characterizing and identifying socially shared self-descriptions in product reviews. In Proceedings of the International AAAI Conference on Web and Social Media, volume 17, pages 808–819.
  69. LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  70. LLaMA 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  71. Srdjan Vucetic. 2020. The Anglosphere: A genealogy of a racialized identity in international relations. Stanford University Press.
  72. Ellen Wagner. 2002. Steps to creating a content strategy for your organization. Best of the eLearning guild’s learning solutions: Top articles from the eMagazine’s first five years, pages 103–120.
  73. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
  74. Data selection for language models via importance resampling.
  75. mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 483–498, Online. Association for Computational Linguistics.
  76. Community identity and user engagement in a multi-community landscape. Proceedings of the International AAAI Conference on Web and Social Media, 11(1):377–386.
  77. A fast, compact, accurate model for language identification of codemixed text. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 328–337, Brussels, Belgium. Association for Computational Linguistics.
  78. Richer countries and richer representations. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2074–2085, Dublin, Ireland. Association for Computational Linguistics.
Citations (11)

Summary

  • The paper demonstrates how pretraining filters inadvertently bias LLM datasets by disproportionately excluding diverse social and regional content.
  • Researchers curated the AboutMe dataset of 10.3M self-descriptions to assess the socio-demographic impact of standard quality and language filters.
  • Findings reveal that these filters favor individual pages over organizational ones, highlighting the need for more equitable and conscious data curation practices.

Overview of the Study

The development of LLMs frequently involves a crucial step known as data curation, where decisions are made to retain or exclude certain types of data. This paper presents an in-depth examination of how commonly applied "quality" and English identification filters influence the representation of webpages in LLM pretraining datasets. The researchers crafted a large dataset, named AboutMe, consisting of 10.3 million self-descriptions from website creators, revealing insights about their topical interests, identified roles, and geographical locations. The paper endeavors to shed light on the nuanced way these web text filters are possibly shaping the social and geographic diversity reflected in LLMs.

Analyzing Social Dimensions

Understanding the impact of LLM data curation necessitates a sociolinguistic approach, and the AboutMe dataset is instrumental in this analysis. The dataset, derived from the Common Crawl archives, empowers the investigation of the social dimensions of language at a socio-demographic level rarely achieved before. By honing in on self-descriptions found on "about" pages, the researchers were able to categorize web creators as individuals or organizations and tag them with relevant social roles, occupational segments, and geographical affiliations. The dataset serves as both the foundation for this paper's analysis and a potential resource for broader research in self-presentation and language variation.

The Impact of Pretraining Filters

The paper explores the consequences of applying ten different "quality" and language filters to the AboutMe dataset. Intriguingly, the research reveals that seemingly generic quality filters can inadvertently function as content filters, favoring certain topical domains while disregarding others. Moreover, English language filters exhibit a tendency to overlook English content from specific world regions, suggesting biases that tie into broader linguistic and geopolitical structures. This point becomes particularly crucial when considering the global reach and potential impact of LLMs, which serve users from varied linguistic and cultural backgrounds.

Insights and Implications

The findings of the research illuminate understated biases within LLM pretraining data curation. For instance, web content from Asia, as well as certain social roles and occupations, face higher removal rates by quality and language filters. Surprisingly, pages by individuals are more likely to be retained than those by organizations. The paper emphasizes the importance for AI developers and users to recognize the silent, yet substantial, preferences embedded within data filtering practices. By harnessing this recognition, the AI community can strive toward more conscious, equitable, and representative LLM development that honours the linguistic and social diversity of the global population. The comprehensive code, data, and resources utilized in the paper have been made available by the authors to catalyze ongoing research in pretraining data curation methods and their societal influences.