- The paper demonstrates a two-stage retrofit method that prunes non-target tokens and surgically refines vocabulary to optimize Indic-script compression.
- The paper shows a 26.7% overall token reduction for Indic corpora, achieving up to 4.31ร compression for Odia using corpus audits and linear programming.
- The paper maintains drop-in compatibility with o200k_base by preserving BPE merge rules and ensuring balanced performance across English, EU languages, code, and math.
BrahmicTokenizer-131K: Indic-Script Compression for General-Purpose Multilingual Tokenization
Construction Methodology
BrahmicTokenizer-131K is a 131,072-vocabulary byte-level BPE tokenizer intended as a drop-in replacement for OpenAI's o200k_base for LLM training pipelines, engineered to close the Brahmic script compression gap observed in Indian languages. The construction employs a two-stage retrofit strategy:
- Script-Prune Crop: The original o200k_base vocabulary (200,019 tokens) is pruned to remove 38,345 tokens corresponding to nine non-target writing systems (CJK, Hangul, Arabic, Cyrillic, Thai, Greek, Hebrew, Sinhala, Japanese). The resulting 131,072-token intermediate ("o200k_cropped") retains English, EU languages, code, math, and Brahmic-script coverage.
- Surgical Brahmic Retrofit: Utilizing a corpus audit methodology, 2,372 dead slots (tokens with zero activation in a 1.045-billion-token Indic audit corpus) are identified and replaced with high-frequency Brahmic-script vocabulary. Allocation to scripts leverages linear programming over saturation curves to maximize tokenization efficiency. Merge rules, pre-tokenizer, and decoder are inherited unchanged, ensuring drop-in compatibility at the tokenizer interface level. The embedding matrix resizes from 200,019 to 131,072 rowsโa standard operation in LLM workflows.
Empirical Evaluation
Comprehensive evaluation encompasses token volume and compression metrics on both in-distribution and out-of-distribution corpora, spanning Indic scripts, English, EU languages, code, and mathematics.
- Indic Compression: On 27 million documents (2.84B words, 46.21GB) of Indic text, BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken/Sarvam-m at identical vocabulary budget, with savings per language ranging from 15.79% (Tamil) to 76.79% (Odia; 4.31ร compression ratio). The Odia result is directly attributed to Tekken/Sarvam-m containing zero Oriya-block tokens, whereas the retrofit introduces 725 new tokens for this block.
- English, EU, and Code/Math Compression: On non-Indic content, compression matches o200k_base on English (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0โ14.2% on HumanEval, MBPP, and GSM8K. EU language compression is competitive: within 3% of best for French, German, and Spanish on FLORES-200.
- General-Purpose Capability: Among 14 evaluated tokenizers (48Kโ262K vocab), BrahmicTokenizer-131K is uniquely competitive across all content classes at the 131K vocab budget. Specialist tokenizers (e.g., Sarvam-1 at 68K, Sarvam-30B at 262K) surpass it in Indic compression but compromise English/code/math coverage. Sarvam-1's English fertility is 15.9% worse and code/math compression is 26โ33% worse compared to BrahmicTokenizer-131K.
- Structural Diagnostics: The design enforces script purity and token byte-length constraints relevant for byte-level embedding architectures (max token length of 32 UTF-8 bytes, no cross-script merges). Only BrahmicTokenizer-131K and o200k_cropped among the benchmark tokenizers satisfy both criteria.
Implications and Comparative Analysis
The empirical evidence underscores significant reduction in token count for Indic-language corpora, directly translating to lower compute cost per training/inference step and increased effective context window for LLMs deployed in South Asian linguistic environments. Preservation of English, EU, code, and math efficiency is achieved by maintaining o200k_base's BPE merge structure and pre-tokenizer, ensuring model performance on high-resource languages and computational tasks is not compromised.
Contradictory claims are substantiated: frequency-based BPE training underrepresents Brahmic scripts in existing multilingual tokenizers, systematically failing rare content classes. Surgical retrofitting via corpus audit addresses this deficit without the compute overhead of re-training BPE from scratch and avoids the risk of rare but critical vocabulary omission. Moreover, specialist tokenizers achieve greater Indic compression but materially degrade performance metrics on major non-Indic axes, reinforcing the trade-off inherent in vocabulary allocation.
The methodologyโretrofit by dead-slot replacement and merge rule preservationโis extensible to domain-specific or additional language coverage scenarios (e.g., expansion for Arabic, CJK, low-resource scripts) where high-quality source tokenizers exist.
Theoretical and Practical Extensions
The approach adopted by BrahmicTokenizer-131K constitutes a scalable technique for addressing tokenization inefficiencies in low-resource languages within general-purpose LLM architectures. The pragma of auditing and dead-slot replacement can be leveraged for continual vocabulary curation as corpus distributions evolve. Additionally, the LP-based allocation policy enables principled balancing among coverage constraints when expanding to further scripts. The preservation of structural properties facilitates compatibility with future embedding and downstream model innovations.
Future research can further optimize per-language allocation, refine whole-word token entries, and improve code/math/EU language trade-offs. Public release of comparable artifacts (e.g., MUTANT-Indic (Rana et al., 5 Nov 2025)) will enable expanded benchmarking. The applicability of this methodology supports both practical deployment and theoretical studyโin particular, analysis of how token-level granularity influences context utilization, gradient propagation, and cross-lingual performance in LLMs.
Conclusion
BrahmicTokenizer-131K bridges the tokenization tax for Brahmic-script Indian languages at the 131K vocabulary budget, achieving substantial compression improvements over comparable open-source multilingual tokenizers with no sacrifice in English, EU, code, or math performance. The structural, empirical, and methodological claims are validated across multiple benchmarks and detailed corpus audits. The artifact is released under Apache 2.0 and facilitates seamless substitution in LLM pipelines reliant on o200k_base, contributing a compute-efficient solution to equitable multilingual model training and serving as a reference point for future tokenizer design strategies (2605.29379).