Papers
Topics
Authors
Recent
Search
2000 character limit reached

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

Published 28 May 2026 in cs.CL and cs.LG | (2605.29379v1)

Abstract: We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

Authors (1)

Summary

  • The paper demonstrates a two-stage retrofit method that prunes non-target tokens and surgically refines vocabulary to optimize Indic-script compression.
  • The paper shows a 26.7% overall token reduction for Indic corpora, achieving up to 4.31ร— compression for Odia using corpus audits and linear programming.
  • The paper maintains drop-in compatibility with o200k_base by preserving BPE merge rules and ensuring balanced performance across English, EU languages, code, and math.

BrahmicTokenizer-131K: Indic-Script Compression for General-Purpose Multilingual Tokenization

Construction Methodology

BrahmicTokenizer-131K is a 131,072-vocabulary byte-level BPE tokenizer intended as a drop-in replacement for OpenAI's o200k_base for LLM training pipelines, engineered to close the Brahmic script compression gap observed in Indian languages. The construction employs a two-stage retrofit strategy:

  1. Script-Prune Crop: The original o200k_base vocabulary (200,019 tokens) is pruned to remove 38,345 tokens corresponding to nine non-target writing systems (CJK, Hangul, Arabic, Cyrillic, Thai, Greek, Hebrew, Sinhala, Japanese). The resulting 131,072-token intermediate ("o200k_cropped") retains English, EU languages, code, math, and Brahmic-script coverage.
  2. Surgical Brahmic Retrofit: Utilizing a corpus audit methodology, 2,372 dead slots (tokens with zero activation in a 1.045-billion-token Indic audit corpus) are identified and replaced with high-frequency Brahmic-script vocabulary. Allocation to scripts leverages linear programming over saturation curves to maximize tokenization efficiency. Merge rules, pre-tokenizer, and decoder are inherited unchanged, ensuring drop-in compatibility at the tokenizer interface level. The embedding matrix resizes from 200,019 to 131,072 rowsโ€”a standard operation in LLM workflows.

Empirical Evaluation

Comprehensive evaluation encompasses token volume and compression metrics on both in-distribution and out-of-distribution corpora, spanning Indic scripts, English, EU languages, code, and mathematics.

  • Indic Compression: On 27 million documents (2.84B words, 46.21GB) of Indic text, BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken/Sarvam-m at identical vocabulary budget, with savings per language ranging from 15.79% (Tamil) to 76.79% (Odia; 4.31ร— compression ratio). The Odia result is directly attributed to Tekken/Sarvam-m containing zero Oriya-block tokens, whereas the retrofit introduces 725 new tokens for this block.
  • English, EU, and Code/Math Compression: On non-Indic content, compression matches o200k_base on English (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0โ€“14.2% on HumanEval, MBPP, and GSM8K. EU language compression is competitive: within 3% of best for French, German, and Spanish on FLORES-200.
  • General-Purpose Capability: Among 14 evaluated tokenizers (48Kโ€“262K vocab), BrahmicTokenizer-131K is uniquely competitive across all content classes at the 131K vocab budget. Specialist tokenizers (e.g., Sarvam-1 at 68K, Sarvam-30B at 262K) surpass it in Indic compression but compromise English/code/math coverage. Sarvam-1's English fertility is 15.9% worse and code/math compression is 26โ€“33% worse compared to BrahmicTokenizer-131K.
  • Structural Diagnostics: The design enforces script purity and token byte-length constraints relevant for byte-level embedding architectures (max token length of 32 UTF-8 bytes, no cross-script merges). Only BrahmicTokenizer-131K and o200k_cropped among the benchmark tokenizers satisfy both criteria.

Implications and Comparative Analysis

The empirical evidence underscores significant reduction in token count for Indic-language corpora, directly translating to lower compute cost per training/inference step and increased effective context window for LLMs deployed in South Asian linguistic environments. Preservation of English, EU, code, and math efficiency is achieved by maintaining o200k_base's BPE merge structure and pre-tokenizer, ensuring model performance on high-resource languages and computational tasks is not compromised.

Contradictory claims are substantiated: frequency-based BPE training underrepresents Brahmic scripts in existing multilingual tokenizers, systematically failing rare content classes. Surgical retrofitting via corpus audit addresses this deficit without the compute overhead of re-training BPE from scratch and avoids the risk of rare but critical vocabulary omission. Moreover, specialist tokenizers achieve greater Indic compression but materially degrade performance metrics on major non-Indic axes, reinforcing the trade-off inherent in vocabulary allocation.

The methodologyโ€”retrofit by dead-slot replacement and merge rule preservationโ€”is extensible to domain-specific or additional language coverage scenarios (e.g., expansion for Arabic, CJK, low-resource scripts) where high-quality source tokenizers exist.

Theoretical and Practical Extensions

The approach adopted by BrahmicTokenizer-131K constitutes a scalable technique for addressing tokenization inefficiencies in low-resource languages within general-purpose LLM architectures. The pragma of auditing and dead-slot replacement can be leveraged for continual vocabulary curation as corpus distributions evolve. Additionally, the LP-based allocation policy enables principled balancing among coverage constraints when expanding to further scripts. The preservation of structural properties facilitates compatibility with future embedding and downstream model innovations.

Future research can further optimize per-language allocation, refine whole-word token entries, and improve code/math/EU language trade-offs. Public release of comparable artifacts (e.g., MUTANT-Indic (Rana et al., 5 Nov 2025)) will enable expanded benchmarking. The applicability of this methodology supports both practical deployment and theoretical studyโ€”in particular, analysis of how token-level granularity influences context utilization, gradient propagation, and cross-lingual performance in LLMs.

Conclusion

BrahmicTokenizer-131K bridges the tokenization tax for Brahmic-script Indian languages at the 131K vocabulary budget, achieving substantial compression improvements over comparable open-source multilingual tokenizers with no sacrifice in English, EU, code, or math performance. The structural, empirical, and methodological claims are validated across multiple benchmarks and detailed corpus audits. The artifact is released under Apache 2.0 and facilitates seamless substitution in LLM pipelines reliant on o200k_base, contributing a compute-efficient solution to equitable multilingual model training and serving as a reference point for future tokenizer design strategies (2605.29379).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.