NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural (2403.01817v1)

Published 4 Mar 2024 in cs.CL

Abstract: Indonesia's linguistic landscape is remarkably diverse, encompassing over 700 languages and dialects, making it one of the world's most linguistically rich nations. This diversity, coupled with the widespread practice of code-switching and the presence of low-resource regional languages, presents unique challenges for modern pre-trained LLMs. In response to these challenges, we developed NusaBERT, building upon IndoBERT by incorporating vocabulary expansion and leveraging a diverse multilingual corpus that includes regional languages and dialects. Through rigorous evaluation across a range of benchmarks, NusaBERT demonstrates state-of-the-art performance in tasks involving multiple languages of Indonesia, paving the way for future natural language understanding research for under-represented languages.

References (41)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/dippatel1994/status/1765002078420705636

https://twitter.com/wilsonwongso_/status/1764921239406051453

NusaBERT: Teaching IndoBERT to be Multilingual and Multicultural (2403.01817v1)

Summary

Related Papers

Tweets