NUSAAKSARA: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts
This paper introduces NUSAAKSARA, a comprehensive multimodal and multilingual benchmark designed to address the lack of support for traditional Indonesian writing systems in NLP. This benchmark covers eight scripts across seven languages, including low-resource languages that have been historically neglected in NLP research. Notably, it includes the Lampung script, which is not currently supported by Unicode, highlighting the challenges of preserving indigenous scripts beyond linguistic factors alone.
Overview of Key Contributions
The benchmark defines several tasks across text and image modalities. These tasks include image segmentation, optical character recognition (OCR), transliteration, translation, and language identification (LID). The dataset comprises scanned documents and text annotations developed and validated by native speakers and linguistic experts, ensuring high-quality data for NLP model evaluation.
Benchmarking Results:
- The paper evaluates a range of models spanning LLMs, vision-LLMs (VLMs), and task-specific systems. These models include GPT-4o, Gemini Flash, Cendol, Sailor-7B, and others.
- Performance analysis reveals that most NLP and vision models struggle with indigenous scripts in Indonesia, attaining near-zero accuracy in many cases. This performance starkly contrasts their relatively strong results when working with romanized texts.
- The majority of current NLP systems, including those marketed for multilingual capabilities, lack accuracy when it comes to these underrepresented scripts.
Implications and Future Directions
The introduction of NUSAAKSARA brings critical insights into the field of NLP regarding the treatment of underrepresented languages and scripts. The dataset unveils significant gaps in existing models, underscoring the need for broader and more inclusive LLM and dataset development. Future work could focus on improving OCR and transcription technologies for indigenous scripts, enriching language representation in LLMs, and integrating indigenous scripts into multilingual datasets.
Practical Implications:
- The benchmark may catalyze technological interventions aimed at digitally preserving these scripts, thereby maintaining cultural heritage and linguistic diversity.
- Implementation of the dataset could guide the development of educational tools to promote literacy in indigenous scripts among younger generations, contributing toward language preservation efforts.
Theoretical Implications:
- The findings provide insights into typological challenges facing NLP in diverse language contexts, stimulating further research in cross-linguistic NLP adaptations.
- There is potential to develop novel linguistic encoding methods inspired by the unique syntactic and morphological features of these languages.
Challenges:
- The absence of Unicode support for scripts like Lampung illustrates intrinsic challenges that require both technological and policy-oriented solutions for standardized script usage.
In conclusion, NUSAAKSARA stands as a pivotal resource aimed at revitalizing Indonesian traditional scripts through advanced NLP research and application, encouraging future developments that facilitate linguistic equity and enrichment in AI. This work signals a call to the broader research community to engage with these challenges through innovation in both model architecture and dataset development, ensuring that digital degradation does not accelerate the extinction of these scripts at the interface of technology and culture.