Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NusaCrowd: Open Source Initiative for Indonesian NLP Resources (2212.09648v4)

Published 19 Dec 2022 in cs.CL and cs.AI

Abstract: We present NusaCrowd, a collaborative initiative to collect and unify existing resources for Indonesian languages, including opening access to previously non-public resources. Through this initiative, we have brought together 137 datasets and 118 standardized data loaders. The quality of the datasets has been assessed manually and automatically, and their value is demonstrated through multiple experiments. NusaCrowd's data collection enables the creation of the first zero-shot benchmarks for natural language understanding and generation in Indonesian and the local languages of Indonesia. Furthermore, NusaCrowd brings the creation of the first multilingual automatic speech recognition benchmark in Indonesian and the local languages of Indonesia. Our work strives to advance NLP research for languages that are under-represented despite being widely spoken.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (47)
  1. Samuel Cahyawijaya (75 papers)
  2. Holy Lovenia (30 papers)
  3. Alham Fikri Aji (94 papers)
  4. Genta Indra Winata (94 papers)
  5. Bryan Wilie (24 papers)
  6. Rahmad Mahendra (14 papers)
  7. Christian Wibisono (1 paper)
  8. Ade Romadhony (4 papers)
  9. Karissa Vincentio (5 papers)
  10. Fajri Koto (47 papers)
  11. Jennifer Santoso (2 papers)
  12. David Moeljadi (5 papers)
  13. Cahya Wirawan (2 papers)
  14. Frederikus Hudi (6 papers)
  15. Ivan Halim Parmonangan (2 papers)
  16. Ika Alfina (2 papers)
  17. Muhammad Satrio Wicaksono (1 paper)
  18. Ilham Firdausi Putra (3 papers)
  19. Samsul Rahmadani (1 paper)
  20. Yulianti Oenang (1 paper)
Citations (39)

Summary

We haven't generated a summary for this paper yet.