Overview of "IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained LLM for Indonesian NLP"
The paper focuses on addressing the under-representation of the Indonesian language in the field of NLP due to a lack of annotated datasets, a scarcity of language resources, and insufficient standardization. The authors introduce IndoLEM, a benchmark dataset spanning seven NLP tasks that encompass morpho-syntactic, semantic, and discourse competencies specifically for Indonesian. Additionally, they introduce IndoBERT, a pre-trained LLM for Indonesian that exhibits state-of-the-art performance across most tasks included in IndoLEM.
IndoLEM: A Comprehensive Resource
IndoLEM consists of tasks across multiple linguistic dimensions, carefully designed to capture a wide array of competencies:
- Morpho-syntactic/Sequence Labeling Tasks: These include Part-of-Speech (POS) tagging, Named Entity Recognition (NER), and dependency parsing. The datasets for these tasks draw from a variety of sources, including publicly-available corpora and previous works, with standardized metrics and splits to ensure reproducibility and robustness.
- Semantic Tasks: Two primary tasks are included—sentiment analysis and summarization. The sentiment analysis dataset is a low-resource binary classification task with data sourced from social media and reviews, while the summarization task is rooted in single-document extractive methods, leveraging the IndoSum dataset.
- Discourse Coherence Tasks: The authors propose novel tasks measuring discourse coherence in Twitter threads—specifically, next tweet prediction and tweet ordering. Here, tweets are used as the dataset to structure authentic and challenging sequence prediction and ordering tasks.
IndoBERT: A Pre-Trained LLM
IndoBERT is a monolingual BERT-style model trained via a masked LLMing approach using Indonesian-specific corpora. Its architecture mirrors that of BERT-Base (12 layers, 768 hidden states), with training data sourced from Indonesian Wikipedia, news outlets, and the Indonesian Web Corpus. The model was trained on comprehensive data amounting to over 220 million words, leading to superior performance, as showcased in its evaluations over IndoLEM.
Comparative Evaluation and Results
The experimental results provide a comparative analysis of IndoBERT against other models such as multilingual BERT (mBERT) and MalayBERT. IndoBERT emerges as a superior performer across a majority of tasks:
- Morpho-syntactic Tasks: IndoBERT achieves competitive accuracy in POS tagging and notably higher F1 scores in NER tasks compared to mBERT and MalayBERT.
- Dependency Parsing: The BiAffine parser augmented with IndoBERT features robust results, outperforming other models on UD-Indo-GSD, whereas mBERT excels on UD-Indo-PUD due to its alignment with multilingual contexts.
- Semantic and Discourse Tasks: IndoBERT significantly improves sentiment analysis outcomes and provides measurable gains in extractive summarization tasks. For discourse coherence, IndoBERT approaches human-level performance in next tweet prediction and ranks highly in tweet ordering, exemplifying substantial room for development within these tasks given their complexity.
Implications and Future Directions
IndoBERT and IndoLEM set a new benchmark for Indonesian NLP, providing a foundation to catalyze future research. IndoLEM offers a resource-rich platform for evaluating NLP models specifically within the Indonesian language context, facilitating efforts to standardize and compare advances in this domain. The IndoBERT model could serve as a base model for further task-specific fine-tuning, setting a pathway for improving linguistic processing in lesser-studied languages. Future research may focus on leveraging IndoLEM to expand beyond current benchmarks and explore more diverse LLMing challenges, potentially resembling advancements parallel to high-resource languages.
IndoLEM and IndoBERT's contributions hold substantial potential to enrich NLP capabilities for Indonesian, thereby enhancing computational approaches available for Southeast Asian languages in general.