IST: Efficient Tokenization for Indic Languages
- IndicSuperTokenizer (IST) is a hybrid tokenizer that combines subword and superword tokenization to significantly reduce token fragmentation and improve fertility scores.
- It employs a four-step process including Unicode normalization, script-agnostic pre-tokenization, two-stage BPE merging, and byte fallback to handle the complexity of Indic scripts.
- Evaluations demonstrate that IST reduces sequence lengths and boosts inference throughput by up to 44% while maintaining competitive downstream performance.
IndicSuperTokenizer (IST) is an optimized tokenizer for Indic multilingual LLMs designed to address the unique challenges posed by the linguistic diversity, complex scripts, and morphological richness of Indian languages. IST combines subword and multi-word ("superword") tokenization with script-agnostic pre-tokenization, resulting in more linguistically aligned tokens, reduced sequence lengths, and significant improvements in inference efficiency and fairness for Indic language users. It is evaluated across English, 22 Indian languages, and code data, achieving a new state-of-the-art in fertility score and inference throughput while maintaining parity on downstream task performance with established English-centric tokenizers (Rana et al., 5 Nov 2025).
1. Motivation and Design Principles
Tokenization is pivotal for the training efficiency and inference cost of LLMs, particularly in multilingual contexts characterized by significant variation in script and word formation. Indic scripts such as Devanagari, Bengali, Gurmukhi, Tamil, Telugu, Kannada, Malayalam, Meitei-Mayek, Ol Chiki, and Arabic-derived scripts (for Urdu/Sindhi) exhibit substantial morphological complexity—agglutination, compounding, and inflection—that amplifies fragmentation when using standard subword algorithms such as Byte Pair Encoding (BPE).
A key challenge addressed by IST is the extreme fertility imbalance found in prior systems. Fertility, the mean token-to-word ratio, is highly unfavorable for many Indic languages under typical BPE; for instance, LLaMA-4’s default tokenizer produces a fertility of 10.5 for Oriya compared to ~1.3 for English. High fertility increases sequence length, inflates memory requirements and latency, and imposes an unfair computational burden on Indic-language users. The hybrid approach of IST—combining subword learning for morphological granularity and multi-word ("superword") merging for phrase-level compactness—is motivated by the need to reduce fragmentation and better align tokens with semantic units.
2. Tokenization Pipeline and Algorithm
IST’s architecture comprises a structured four-step process: Unicode normalization, script-agnostic pre-tokenization, a two-stage BPE-based token learning procedure, and open-vocabulary byte fallback.
2.1 Unicode Normalization
Normalization is performed using NFKC (Normalization Form Compatibility Composition), collapsing visually equivalent Unicode codepoints. Analysis indicates negligible differences among NFC, NFD, and NFKC in practice, justifying the choice of NFKC for all inputs.
2.2 Script-Agnostic Pre-tokenization
Pre-tokenization moves beyond GPT-2’s regular expressions by adopting LLaMA-4’s regex rules for whitespace/punctuation splitting, numeric grouping, and generalized script separation. Sentence delimiters (e.g., period, exclamation, question mark, emoji) are explicitly marked to prevent BPE merges across sentences. Although morphology-aware splitting for Devanagari (root/affix segregation) was explored, it was not incorporated into the production version due to latency overhead.
2.3 Two-Stage Subword–Superword BPE
Let denote final vocabulary size, and the transition size.
- Stage 1: Subword Learning—BPE merges are restricted within pre-tokenized words (whitespace boundary). Merges are selected by maximizing frequency of adjacent pairs within words. This continues until .
- Stage 2: Superword Learning—Whitespace constraints are lifted, enabling BPE merges across adjacent tokens within the same sentence (but not crossing sentence boundaries) until the vocabulary reaches .
Pseudocode excerpt: 4 During each merge, the pair with highest frequency is merged, as defined by .
2.4 Byte Fallback
IST is open-vocabulary: any unknown token is decomposed into UTF-8 bytes, ensuring no out-of-vocabulary (OOV) tokens.
3. Metrics: Fertility, Compactness, and Token Efficiency
Fertility () quantifies token fragmentation: where are input lines, and for each 0, 1 is the count of whitespace-separated words and 2 is the number of output tokens. Ideal fertility is 3.
Comparative Snapshots (average fertility across 24 languages):
| Tokenizer | Average Fertility |
|---|---|
| Gemma-3 | ≈2.3 |
| Sutra | ≈2.1 |
| LLaMA-4 | ≈3.1 |
| IST | ≈1.8 |
For Assamese (“as”): LLaMA-4 = 4.40, Sutra = 2.12, IST = 1.85. For English (“eng”): LLaMA-4 = 1.34, Sutra = 1.17, IST = 1.12.
IST achieves 20–40% lower fertility than Sutra or LLaMA-4, corresponding to substantially more compact tokenization and enhanced throughput (Rana et al., 5 Nov 2025).
4. Data Sources, Preprocessing, and Vocabulary
IST is trained on ~10 GB of mixed web, literature, and code data: OLMo web filter data, CommonCrawl, Wikipedia, curated books/PDFs, Sangraha Verified Indic corpus, and code snippets from StackV2. The corpus spans 22 scheduled Indian languages, English, and source code as a pseudo-language.
Preprocessing entails NFKC normalization, script-agnostic pre-tokenization, and sentence boundary marking. The shared multilingual vocabulary comprises 4 tokens, with vocabulary allocation across scripts proportional to their corpus representation (e.g., 38% Devanagari, 32% Latin). No explicit minimum frequency filter is applied; BPE’s inherent pruning suffices, and byte fallback covers rare cases.
5. Ablation Studies and Robustness
Extensive ablations evaluate the impact of various design choices:
| Design Axis | Key Finding |
|---|---|
| Tokenizer data size | Fertility plateaus beyond ~10 GB |
| Vocabulary size | Minimal gains beyond 200 K tokens |
| Pre-tokenization strategy | LLaMA-4 regex: 38–40% lower fertility vs GPT-2 regex |
| Merge strategy (1-stage vs 2-stage) | Two-stage IST outperforms 1-stage "BoundlessBPE" |
| Transition point 5 | 6 optimal |
For each axis, metrics reported include average fertility, normalized sequence length (NSL), and inference throughput. Two-stage merging consistently yields superior fertility and compactness over single-stage approaches (Rana et al., 5 Nov 2025).
6. Empirical Evaluation
6.1 Intrinsic Metrics
IST attains the best fertility (in 20/24 Indic languages), best NSL (23/24), and best bytes-per-token (22/24). Additional metrics such as Rényi’s entropy reflect superior token efficiency.
6.2 Inference Throughput
In head-to-head evaluation, two identical 1B-parameter LLaMA-3.2 models, differing only in tokenizer, are compared on mixed-language Generation. Testing on 8xH100 GPUs with 200 prompts:
- Time-to-First-Token: IST = 18.98 ms, LLaMA-4 = 19.17 ms
- Output Throughput: IST = 169.42 tokens/s, LLaMA-4 = 117.99 tokens/s (IST achieves +44% relative gain)
Throughput is computed as: 7 Relative gain: 8
6.3 Downstream Task Performance
LLMs utilizing IST maintain downstream performance on standardized English and Indic benchmarks:
| Benchmark Type | LLaMA-4 Accuracy | IST Accuracy |
|---|---|---|
| English (avg) | 0.279 | 0.279 |
| Indic (avg) | 0.388 | 0.394 |
Evaluations use HellaSwag, CommonsenseQA, MMLU (English), and IndicCOPA, IndicSentiment, IndicXNLI, among others (Indic).
7. Implementation and Integration Guidance
IST is implemented using the open-source SuperBPE engine (“PythonNut/superbpe”) and HuggingFace tokenizers. Hyperparameters include a 200,008-entry vocabulary, transition 9 at 90% of 0, and deterministic BPE merging (no learning rate or gradient). Computational complexity is 1 per merge for corpus size 2 and average line length 3. Stage 2 (superword merging) is more memory-intensive (~2× Stage 1), but is a one-time cost.
Continual pretraining, in line with ReTok approaches, allows plug-in replacement by freezing most model layers and retraining embeddings and output heads, with new token embeddings initialized as averages over prior subword representations.
IST is distributed as a HuggingFace-compatible tokenizer: IndicSuperTokenizer.from_pretrained(...). For retrofitting existing models, continual-pretraining is recommended on mixed English/Indic/code datasets with most layers frozen and focused training of embedding and language modeling head weights.
Limitations: Morphology-aware splitting is not used in production due to latency; sentence-level merge constraints prevent cross-sentence superword merges without impeding idiomatic within-sentence multi-word encoding. Outlier languages with extremely small corpora (<50 MB) may show elevated fragmentation; tuning the regex per script is suggested. IST’s code vocab primarily covers mainstream languages—specialized DSLs may require extending the vocabulary.
IST provides a drop-in solution for new or existing LLM pipelines serving Indic multilingual workloads, addressing fragmentation and fairness with demonstrated improvements in sequence compactness and throughput, and robust performance across numerous ablations (Rana et al., 5 Nov 2025).