Papers
Topics
Authors
Recent
Search
2000 character limit reached

Tokenizations for Austronesian Language Models: study on languages in Indonesia Archipelago

Published 28 Jan 2026 in cs.CY | (2602.06998v1)

Abstract: Tokenization constitutes a fundamental stage in LLM processing; however, subword-based tokenization methods optimized on English-dominant corpora may produce token fragmentation misaligned with the linguistic structures of Austronesian languages. This study aimed to develop a syllable-based tokenization framework adopting principles from traditional Indonesian scripts (aksara) for regional languages of Indonesia. A syllabic segmentation procedure was constructed based on the logic of abugida writing systems and implemented with a vocabulary of 2,843 tokens extracted from the Indonesian dictionary (KBBI). Evaluation was conducted on the NusaX dataset comprising 1,000 parallel translation samples across 10 regional languages, Indonesian, and English. Analysis employed Token per Character (TPC) ratio and sequence alignment using the Smith-Waterman algorithm. Results demonstrated that syllable-based tokenization yielded consistent TPC values across all regional languages, whereas GPT-2 exhibited an inverse pattern with the lowest TPC for English. Syllable-based tokenization consistently produced higher token sequence similarity scores, with an average increase of approximately 21% compared to GPT-2. These findings confirm that the syllable-based approach more effectively preserves phonological and morphological patterns across related Austronesian languages, offering a linguistically principled foundation for multilingual LLM development.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.