Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural (2403.01817v1)

Published 4 Mar 2024 in cs.CL

Abstract: Indonesia's linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world's most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained LLMs. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia, paving the way for future natural language understanding research for under-represented languages.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. IndoRobusta: Towards robustness against diverse code-mixed Indonesian local languages. In Proceedings of the First Workshop on Scaling Up Multilingual Evaluation, pages 25–34, Online. Association for Computational Linguistics.
  2. One country, 700+ languages: NLP challenges for underrepresented languages and dialects in Indonesia. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7226–7249, Dublin, Ireland. Association for Computational Linguistics.
  3. Badan Pusat Statistik. 2010. Kewarganegaraan suku bangsa, agama, bahasa 2010.
  4. Automatic speech recognition for under-resourced languages: A survey. Speech Communication, 56:85–100.
  5. Bloom: A 176b-parameter open-access multilingual language model.
  6. NusaWrites: Constructing high-quality corpora for underrepresented and extremely low-resource languages. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 921–945, Nusa Dua, Bali. Association for Computational Linguistics.
  7. IndoNLG: Benchmark and resources for evaluating Indonesian natural language generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8875–8898, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  8. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
  9. Alexis Conneau and Guillaume Lample. 2019. Cross-lingual language model pretraining. Advances in neural information processing systems, 32.
  10. M.C. Corporation. 2007. World and Its Peoples: Eastern and Southern Asia. Number v. 10 in World and Its Peoples Series. Marshall Cavendish.
  11. No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
  12. Efficiently adapting pretrained language models to new languages.
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  14. Ethnologue: Languages of the world.
  15. John Hewitt. 2021. Initializing new word embeddings for pretrained language models.
  16. IndoBERTweet: A pretrained language model for Indonesian Twitter with effective domain-specific vocabulary initialization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10660–10668, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  17. IndoLEM and IndoBERT: A benchmark dataset and pre-trained language model for Indonesian NLP. In Proceedings of the 28th International Conference on Computational Linguistics, pages 757–770, Barcelona, Spain (Online). International Committee on Computational Linguistics.
  18. Quality at a glance: An audit of web-crawled multilingual datasets. Transactions of the Association for Computational Linguistics, 10:50–72.
  19. Madlad-400: A multilingual and document-level large audited dataset. Advances in Neural Information Processing Systems, 36.
  20. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, Online. Association for Computational Linguistics.
  21. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965.
  22. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  23. Wangchanberta: Pretraining transformer-based thai language models.
  24. Culturax: A cleaned, enormous, and multilingual dataset for large language models in 167 languages. arXiv preprint arXiv:2309.09400.
  25. MAD-X: An Adapter-Based Framework for Multi-Task Cross-Lingual Transfer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7654–7673, Online. Association for Computational Linguistics.
  26. Typhoon: Thai large language models.
  27. Improving language understanding by generative pre-training.
  28. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  29. Exploring the limits of transfer learning with a unified text-to-text transformer.
  30. Phayathaibert: Enhancing a pretrained thai language model with unassimilated loanwords.
  31. Asynchronous pipeline for processing huge corpora on medium to low resource infrastructures. In 7th Workshop on the Challenges in the Management of Large Corpora (CMLC-7). Leibniz-Institut für Deutsche Sprache.
  32. Bahasa Indonesia Kesehatan. Penerbit Andi.
  33. Expanding pretrained models to thousands more languages via lexicon-based adaptation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 863–877, Dublin, Ireland. Association for Computational Linguistics.
  34. CCNet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4003–4012, Marseille, France. European Language Resources Association.
  35. IndoNLU: Benchmark and resources for evaluating Indonesian natural language understanding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 843–857, Suzhou, China. Association for Computational Linguistics.
  36. NusaX: Multilingual parallel sentiment dataset for 10 Indonesian local languages. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 815–834, Dubrovnik, Croatia. Association for Computational Linguistics.
  37. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  38. Many-to-many multilingual translation model for languages of indonesia. IEEE Access, 11:91385–91397.
  39. Google’s neural machine translation system: Bridging the gap between human and machine translation.
  40. BLOOM+1: Adding language support to BLOOM for zero-shot prompting. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11682–11703, Toronto, Canada. Association for Computational Linguistics.
  41. Multilingual large language models are not (yet) code-switchers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12567–12582, Singapore. Association for Computational Linguistics.
Citations (1)

Summary

We haven't generated a summary for this paper yet.