NusaAksara: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

Published 25 Feb 2025 in cs.CL | (2502.18148v1)

Abstract: Indonesia is rich in languages and scripts. However, most NLP progress has been made using romanized text. In this paper, we present NusaAksara, a novel public benchmark for Indonesian languages that includes their original scripts. Our benchmark covers both text and image modalities and encompasses diverse tasks such as image segmentation, OCR, transliteration, translation, and language identification. Our data is constructed by human experts through rigorous steps. NusaAksara covers 8 scripts across 7 languages, including low-resource languages not commonly seen in NLP benchmarks. Although unsupported by Unicode, the Lampung script is included in this dataset. We benchmark our data across several models, from LLMs and VLMs such as GPT-4o, Llama 3.2, and Aya 23 to task-specific systems such as PP-OCR and LangID, and show that most NLP technologies cannot handle Indonesia's local scripts, with many achieving near-zero performance.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces NUSAAKSARA, a comprehensive multimodal, multilingual benchmark for eight Indonesian indigenous scripts, addressing their historical lack of NLP support.
Benchmarking results showed that current NLP and vision models, including LLMs and VLMs, often achieve near-zero accuracy on indigenous scripts despite performing well on romanized text.
The benchmark highlights significant model limitations for indigenous scripts, underscoring the urgent need for inclusive dataset and model development to support digital preservation.

NUSAAKSARA: A Multimodal and Multilingual Benchmark for Preserving Indonesian Indigenous Scripts

This paper introduces NUSAAKSARA, a comprehensive multimodal and multilingual benchmark designed to address the lack of support for traditional Indonesian writing systems in NLP. This benchmark covers eight scripts across seven languages, including low-resource languages that have been historically neglected in NLP research. Notably, it includes the Lampung script, which is not currently supported by Unicode, highlighting the challenges of preserving indigenous scripts beyond linguistic factors alone.

Overview of Key Contributions

The benchmark defines several tasks across text and image modalities. These tasks include image segmentation, optical character recognition (OCR), transliteration, translation, and language identification (LID). The dataset comprises scanned documents and text annotations developed and validated by native speakers and linguistic experts, ensuring high-quality data for NLP model evaluation.

Benchmarking Results:

The paper evaluates a range of models spanning LLMs, vision-LLMs (VLMs), and task-specific systems. These models include GPT-4o, Gemini Flash, Cendol, Sailor-7B, and others.
Performance analysis reveals that most NLP and vision models struggle with indigenous scripts in Indonesia, attaining near-zero accuracy in many cases. This performance starkly contrasts their relatively strong results when working with romanized texts.
The majority of current NLP systems, including those marketed for multilingual capabilities, lack accuracy when it comes to these underrepresented scripts.

Implications and Future Directions

The introduction of NUSAAKSARA brings critical insights into the field of NLP regarding the treatment of underrepresented languages and scripts. The dataset unveils significant gaps in existing models, underscoring the need for broader and more inclusive LLM and dataset development. Future work could focus on improving OCR and transcription technologies for indigenous scripts, enriching language representation in LLMs, and integrating indigenous scripts into multilingual datasets.

Practical Implications:

The benchmark may catalyze technological interventions aimed at digitally preserving these scripts, thereby maintaining cultural heritage and linguistic diversity.
Implementation of the dataset could guide the development of educational tools to promote literacy in indigenous scripts among younger generations, contributing toward language preservation efforts.

Theoretical Implications:

The findings provide insights into typological challenges facing NLP in diverse language contexts, stimulating further research in cross-linguistic NLP adaptations.
There is potential to develop novel linguistic encoding methods inspired by the unique syntactic and morphological features of these languages.

Challenges:

The absence of Unicode support for scripts like Lampung illustrates intrinsic challenges that require both technological and policy-oriented solutions for standardized script usage.

In conclusion, NUSAAKSARA stands as a pivotal resource aimed at revitalizing Indonesian traditional scripts through advanced NLP research and application, encouraging future developments that facilitate linguistic equity and enrichment in AI. This work signals a call to the broader research community to engage with these challenges through innovation in both model architecture and dataset development, ensuring that digital degradation does not accelerate the extinction of these scripts at the interface of technology and culture.

Markdown Report Issue