Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is Machine Learning Speaking my Language? A Critical Look at the NLP-Pipeline Across 8 Human Languages (2007.05872v1)

Published 11 Jul 2020 in cs.CL and cs.LG

Abstract: NLP is increasingly used as a key ingredient in critical decision-making systems such as resume parsers used in sorting a list of job candidates. NLP systems often ingest large corpora of human text, attempting to learn from past human behavior and decisions in order to produce systems that will make recommendations about our future world. Over 7000 human languages are being spoken today and the typical NLP pipeline underrepresents speakers of most of them while amplifying the voices of speakers of other languages. In this paper, a team including speakers of 8 languages - English, Chinese, Urdu, Farsi, Arabic, French, Spanish, and Wolof - takes a critical look at the typical NLP pipeline and how even when a language is technically supported, substantial caveats remain to prevent full participation. Despite huge and admirable investments in multilingual support in many tools and resources, we are still making NLP-guided decisions that systematically and dramatically underrepresent the voices of much of the world.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Esma Wali (2 papers)
  2. Yan Chen (272 papers)
  3. Christopher Mahoney (1 paper)
  4. Thomas Middleton (1 paper)
  5. Marzieh Babaeianjelodar (6 papers)
  6. Mariama Njie (2 papers)
  7. Jeanna Neefe Matthews (1 paper)
Citations (9)