Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Natural Language Processing for Dialects of a Language: A Survey (2401.05632v4)

Published 11 Jan 2024 in cs.CL

Abstract: State-of-the-art NLP models are trained on massive training corpora, and report a superlative performance on evaluation datasets. This survey delves into an important attribute of these datasets: the dialect of a language. Motivated by the performance degradation of NLP models for dialectal datasets and its implications for the equity of language technologies, we survey past research in NLP for dialects in terms of datasets, and approaches. We describe a wide range of NLP tasks in terms of two categories: natural language understanding (NLU) (for tasks such as dialect classification, sentiment analysis, parsing, and NLU benchmarks) and natural language generation (NLG) (for summarisation, machine translation, and dialogue systems). The survey is also broad in its coverage of languages which include English, Arabic, German, among others. We observe that past work in NLP concerning dialects goes deeper than mere dialect classification, and extends to several NLU and NLG tasks. For these tasks, we describe classical machine learning using statistical models, along with the recent deep learning-based approaches based on pre-trained LLMs. We expect that this survey will be useful to NLP researchers interested in building equitable language technologies by rethinking LLM benchmarks and model architectures.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Aditya Joshi (43 papers)
  2. Raj Dabre (65 papers)
  3. Diptesh Kanojia (58 papers)
  4. Zhuang Li (69 papers)
  5. Haolan Zhan (27 papers)
  6. Gholamreza Haffari (141 papers)
  7. Doris Dippold (2 papers)
Citations (16)