Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NusaX: Multilingual Parallel Sentiment Dataset for 10 Indonesian Local Languages (2205.15960v2)

Published 31 May 2022 in cs.CL

Abstract: Natural language processing (NLP) has a significant impact on society via technologies such as machine translation and search engines. Despite its success, NLP technology is only widely available for high-resource languages such as English and Chinese, while it remains inaccessible to many languages due to the unavailability of data resources and benchmarks. In this work, we focus on developing resources for languages in Indonesia. Despite being the second most linguistically diverse country, most languages in Indonesia are categorized as endangered and some are even extinct. We develop the first-ever parallel resource for 10 low-resource languages in Indonesia. Our resource includes datasets, a multi-task benchmark, and lexicons, as well as a parallel Indonesian-English dataset. We provide extensive analyses and describe the challenges when creating such resources. We hope that our work can spark NLP research on Indonesian and other underrepresented languages.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (14)
  1. Genta Indra Winata (94 papers)
  2. Alham Fikri Aji (94 papers)
  3. Samuel Cahyawijaya (75 papers)
  4. Rahmad Mahendra (14 papers)
  5. Fajri Koto (47 papers)
  6. Ade Romadhony (4 papers)
  7. Kemal Kurniawan (13 papers)
  8. David Moeljadi (5 papers)
  9. Radityo Eko Prasojo (13 papers)
  10. Pascale Fung (151 papers)
  11. Timothy Baldwin (125 papers)
  12. Jey Han Lau (67 papers)
  13. Rico Sennrich (88 papers)
  14. Sebastian Ruder (93 papers)
Citations (69)