IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding (2009.05387v3)

Published 11 Sep 2020 in cs.CL

Abstract: Although Indonesian is known to be the fourth most frequently used language over the internet, the research progress on this language in the NLP is slow-moving due to a lack of available resources. In response, we introduce the first-ever vast resource for the training, evaluating, and benchmarking on Indonesian natural language understanding (IndoNLU) tasks. IndoNLU includes twelve tasks, ranging from single sentence classification to pair-sentences sequence labeling with different levels of complexity. The datasets for the tasks lie in different domains and styles to ensure task diversity. We also provide a set of Indonesian pre-trained models (IndoBERT) trained from a large and clean Indonesian dataset Indo4B collected from publicly available sources such as social media texts, blogs, news, and websites. We release baseline models for all twelve tasks, as well as the framework for benchmark evaluation, and thus it enables everyone to benchmark their system performances.

PDF Abstract

IndoNLU: Establishing Benchmarks for Indonesian Natural Language Understanding

The progress in NLP has been notable for English and other high-resource languages, yet certain languages like Indonesian remain underrepresented due to limited computational resources and datasets. In addressing this gap, the paper introduces IndoNLU, a comprehensive benchmark designed to evaluate Indonesian Natural Language Understanding (NLU) tasks. IndoNLU encompasses twelve distinct tasks, ensuring diversity across domains and styles, which allows for a balanced evaluation of models on various NLP tasks in Indonesian.

Key Contributions

Task Diversity and Dataset Collection: IndoNLU compiles datasets for twelve tasks, which include emotion classification, sentiment analysis, aspect-based sentiment analysis, textual entailment, part-of-speech tagging, named entity recognition, and span extraction. This diversity in task selection is complemented by datasets that span different domains like tweets, news, and even colloquial data points. Importantly, the paper highlights the absence of standardized splits in existing datasets and opts to resplit them to promote reproducibility.
Indonesian Pre-trained Models: The authors introduce IndoBERT and a variant, IndoBERT-lite, trained on a newly developed dataset called Indo4B. This dataset is a substantial and clean collection sourced from various publicly available Indonesian text, such as news articles, social media, and blogs, comprising about 4 billion words.
Baseline Models and Evaluation: The paper details baseline performances using models ranging from pre-trained contextual LLMs to those trained from scratch or using existing fastText embeddings. Notably, IndoBERT and IndoBERT-lite models show performance advantages over multilingual models like mBERT and XLM-R, particularly in classification tasks, exemplifying the benefits of language-specific pre-training.
Benchmark Framework and Leaderboard: To foster community participation and facilitate continuous benchmarking, the authors provide a framework for model evaluation across all tasks. They further support this with an accessible leaderboard, encouraging the sharing of benchmark results within the NLP community.

Results and Analysis

The IndoNLU benchmark offers compelling insights into the effectiveness of monolingual pre-trained models versus multilingual ones. IndoBERT outperforms many existing multilingual models on a majority of tasks, highlighting that a focused, language-specific model surpasses broader, multilingual ones in capturing semantic intricacies. However, on tasks heavily reliant on understanding entity names across different languages (e.g., NER), multilingual models maintain a slight edge due to their broader linguistic coverage.

Theoretical and Practical Implications

From a theoretical perspective, this work underscores the necessity and viability of developing language-specific benchmarks and resources, which can significantly elevate the understanding of context in less-represented languages. Practically, the IndoNLU benchmark is now a cornerstone for Indonesian NLP, providing a structured, reliable foundation for researchers looking to advance computational linguistics in non-English languages.

Future Developments

The establishment of IndoNLU paves the way for further developments in specialized models for Indonesian and similar languages facing resource constraints. Future work may entail expanding the dataset diversity beyond textual data or even integrating multimodal datasets to enrich the benchmark further. There is also potential for exploring cross-lingual transfer learning methods that capitalize on the knowledge from high-resource languages to enhance low-resource LLMs further.

In conclusion, the IndoNLU benchmark represents a significant advancement for Indonesian NLP efforts, serving as both a resource and a catalyst for further research innovation in the field. The introduction of IndoBERT and IndoBERT-lite within this benchmark provides a toolset for high-performance language understanding tasks in Indonesian, bridging the gap toward equal representation in NLP advancements.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Bryan Wilie (24 papers)
Karissa Vincentio (5 papers)
Genta Indra Winata (94 papers)
Samuel Cahyawijaya (75 papers)
Xiaohong Li (43 papers)
Zhi Yuan Lim (2 papers)
Sidik Soleman (2 papers)
Rahmad Mahendra (14 papers)
Pascale Fung (151 papers)
Syafri Bahar (2 papers)
Ayu Purwarianti (39 papers)

Citations (257)

View on Semantic Scholar