IST: Efficient Tokenization for Indic Languages

Updated 16 May 2026

IndicSuperTokenizer (IST) is a hybrid tokenizer that combines subword and superword tokenization to significantly reduce token fragmentation and improve fertility scores.
It employs a four-step process including Unicode normalization, script-agnostic pre-tokenization, two-stage BPE merging, and byte fallback to handle the complexity of Indic scripts.
Evaluations demonstrate that IST reduces sequence lengths and boosts inference throughput by up to 44% while maintaining competitive downstream performance.

IndicSuperTokenizer (IST) is an optimized tokenizer for Indic multilingual LLMs designed to address the unique challenges posed by the linguistic diversity, complex scripts, and morphological richness of Indian languages. IST combines subword and multi-word ("superword") tokenization with script-agnostic pre-tokenization, resulting in more linguistically aligned tokens, reduced sequence lengths, and significant improvements in inference efficiency and fairness for Indic language users. It is evaluated across English, 22 Indian languages, and code data, achieving a new state-of-the-art in fertility score and inference throughput while maintaining parity on downstream task performance with established English-centric tokenizers (Rana et al., 5 Nov 2025).

1. Motivation and Design Principles

Tokenization is pivotal for the training efficiency and inference cost of LLMs, particularly in multilingual contexts characterized by significant variation in script and word formation. Indic scripts such as Devanagari, Bengali, Gurmukhi, Tamil, Telugu, Kannada, Malayalam, Meitei-Mayek, Ol Chiki, and Arabic-derived scripts (for Urdu/Sindhi) exhibit substantial morphological complexity—agglutination, compounding, and inflection—that amplifies fragmentation when using standard subword algorithms such as Byte Pair Encoding (BPE).

A key challenge addressed by IST is the extreme fertility imbalance found in prior systems. Fertility, the mean token-to-word ratio, is highly unfavorable for many Indic languages under typical BPE; for instance, LLaMA-4’s default tokenizer produces a fertility of 10.5 for Oriya compared to ~1.3 for English. High fertility increases sequence length, inflates memory requirements and latency, and imposes an unfair computational burden on Indic-language users. The hybrid approach of IST—combining subword learning for morphological granularity and multi-word ("superword") merging for phrase-level compactness—is motivated by the need to reduce fragmentation and better align tokens with semantic units.

2. Tokenization Pipeline and Algorithm

IST’s architecture comprises a structured four-step process: Unicode normalization, script-agnostic pre-tokenization, a two-stage BPE-based token learning procedure, and open-vocabulary byte fallback.

2.1 Unicode Normalization

Normalization is performed using NFKC (Normalization Form Compatibility Composition), collapsing visually equivalent Unicode codepoints. Analysis indicates negligible differences among NFC, NFD, and NFKC in practice, justifying the choice of NFKC for all inputs.

2.2 Script-Agnostic Pre-tokenization

Pre-tokenization moves beyond GPT-2’s regular expressions by adopting LLaMA-4’s regex rules for whitespace/punctuation splitting, numeric grouping, and generalized script separation. Sentence delimiters (e.g., period, exclamation, question mark, emoji) are explicitly marked to prevent BPE merges across sentences. Although morphology-aware splitting for Devanagari (root/affix segregation) was explored, it was not incorporated into the production version due to latency overhead.

2.3 Two-Stage Subword–Superword BPE

Let $V_{\text{target}} = 200,000$ denote final vocabulary size, and $t = 0.9 \cdot V_{\text{target}}$ the transition size.

Stage 1: Subword Learning—BPE merges are restricted within pre-tokenized words (whitespace boundary). Merges are selected by maximizing frequency of adjacent pairs $(x,y)$ within words. This continues until $|V| = t$ .
Stage 2: Superword Learning—Whitespace constraints are lifted, enabling BPE merges across adjacent tokens within the same sentence (but not crossing sentence boundaries) until the vocabulary reaches $|V| = V_{\text{target}}$ .

Pseudocode excerpt: $(x,y)$ 4 During each merge, the pair $(x,y)$ with highest frequency is merged, as defined by $argmax_{(x,y)} freq(x,y)$ .

2.4 Byte Fallback

IST is open-vocabulary: any unknown token is decomposed into UTF-8 bytes, ensuring no out-of-vocabulary (OOV) tokens.

3. Metrics: Fertility, Compactness, and Token Efficiency

Fertility ( $F$ ) quantifies token fragmentation: $F = \frac{1}{N} \sum_{i=1}^N \left( \frac{T_i}{W_i} \right)$ where $S = \{s_1,...,s_N\}$ are input lines, and for each $t = 0.9 \cdot V_{\text{target}}$ 0, $t = 0.9 \cdot V_{\text{target}}$ 1 is the count of whitespace-separated words and $t = 0.9 \cdot V_{\text{target}}$ 2 is the number of output tokens. Ideal fertility is $t = 0.9 \cdot V_{\text{target}}$ 3.

Comparative Snapshots (average fertility across 24 languages):

Tokenizer	Average Fertility
Gemma-3	≈2.3
Sutra	≈2.1
LLaMA-4	≈3.1
IST	≈1.8

For Assamese (“as”): LLaMA-4 = 4.40, Sutra = 2.12, IST = 1.85. For English (“eng”): LLaMA-4 = 1.34, Sutra = 1.17, IST = 1.12.

IST achieves 20–40% lower fertility than Sutra or LLaMA-4, corresponding to substantially more compact tokenization and enhanced throughput (Rana et al., 5 Nov 2025).

4. Data Sources, Preprocessing, and Vocabulary

IST is trained on ~10 GB of mixed web, literature, and code data: OLMo web filter data, CommonCrawl, Wikipedia, curated books/PDFs, Sangraha Verified Indic corpus, and code snippets from StackV2. The corpus spans 22 scheduled Indian languages, English, and source code as a pseudo-language.

Preprocessing entails NFKC normalization, script-agnostic pre-tokenization, and sentence boundary marking. The shared multilingual vocabulary comprises $t = 0.9 \cdot V_{\text{target}}$ 4 tokens, with vocabulary allocation across scripts proportional to their corpus representation (e.g., 38% Devanagari, 32% Latin). No explicit minimum frequency filter is applied; BPE’s inherent pruning suffices, and byte fallback covers rare cases.

5. Ablation Studies and Robustness

Extensive ablations evaluate the impact of various design choices:

Design Axis	Key Finding
Tokenizer data size	Fertility plateaus beyond ~10 GB
Vocabulary size	Minimal gains beyond 200 K tokens
Pre-tokenization strategy	LLaMA-4 regex: 38–40% lower fertility vs GPT-2 regex
Merge strategy (1-stage vs 2-stage)	Two-stage IST outperforms 1-stage "BoundlessBPE"
Transition point $t = 0.9 \cdot V_{\text{target}}$ 5	$t = 0.9 \cdot V_{\text{target}}$ 6 optimal

For each axis, metrics reported include average fertility, normalized sequence length (NSL), and inference throughput. Two-stage merging consistently yields superior fertility and compactness over single-stage approaches (Rana et al., 5 Nov 2025).

6. Empirical Evaluation

6.1 Intrinsic Metrics

IST attains the best fertility (in 20/24 Indic languages), best NSL (23/24), and best bytes-per-token (22/24). Additional metrics such as Rényi’s entropy reflect superior token efficiency.

6.2 Inference Throughput

In head-to-head evaluation, two identical 1B-parameter LLaMA-3.2 models, differing only in tokenizer, are compared on mixed-language Generation. Testing on 8xH100 GPUs with 200 prompts:

Time-to-First-Token: IST = 18.98 ms, LLaMA-4 = 19.17 ms
Output Throughput: IST = 169.42 tokens/s, LLaMA-4 = 117.99 tokens/s (IST achieves +44% relative gain)

Throughput is computed as: $t = 0.9 \cdot V_{\text{target}}$ 7 Relative gain: $t = 0.9 \cdot V_{\text{target}}$ 8

6.3 Downstream Task Performance

LLMs utilizing IST maintain downstream performance on standardized English and Indic benchmarks:

Benchmark Type	LLaMA-4 Accuracy	IST Accuracy
English (avg)	0.279	0.279
Indic (avg)	0.388	0.394

Evaluations use HellaSwag, CommonsenseQA, MMLU (English), and IndicCOPA, IndicSentiment, IndicXNLI, among others (Indic).

7. Implementation and Integration Guidance

IST is implemented using the open-source SuperBPE engine (“PythonNut/superbpe”) and HuggingFace tokenizers. Hyperparameters include a 200,008-entry vocabulary, transition $t = 0.9 \cdot V_{\text{target}}$ 9 at 90% of $(x,y)$ 0, and deterministic BPE merging (no learning rate or gradient). Computational complexity is $(x,y)$ 1 per merge for corpus size $(x,y)$ 2 and average line length $(x,y)$ 3. Stage 2 (superword merging) is more memory-intensive (~2× Stage 1), but is a one-time cost.

Continual pretraining, in line with ReTok approaches, allows plug-in replacement by freezing most model layers and retraining embeddings and output heads, with new token embeddings initialized as averages over prior subword representations.

IST is distributed as a HuggingFace-compatible tokenizer: IndicSuperTokenizer.from_pretrained(...). For retrofitting existing models, continual-pretraining is recommended on mixed English/Indic/code datasets with most layers frozen and focused training of embedding and language modeling head weights.

Limitations: Morphology-aware splitting is not used in production due to latency; sentence-level merge constraints prevent cross-sentence superword merges without impeding idiomatic within-sentence multi-word encoding. Outlier languages with extremely small corpora (<50 MB) may show elevated fragmentation; tuning the regex per script is suggested. IST’s code vocab primarily covers mainstream languages—specialized DSLs may require extending the vocabulary.

IST provides a drop-in solution for new or existing LLM pipelines serving Indic multilingual workloads, addressing fragmentation and fairness with demonstrated improvements in sequence compactness and throughput, and robust performance across numerous ablations (Rana et al., 5 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

IndicSuperTokenizer: An Optimized Tokenizer for Indic Multilingual LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to IndicSuperTokenizer (IST).

IST: Efficient Tokenization for Indic Languages

1. Motivation and Design Principles

2. Tokenization Pipeline and Algorithm

2.1 Unicode Normalization

2.2 Script-Agnostic Pre-tokenization

2.3 Two-Stage Subword–Superword BPE

2.4 Byte Fallback

3. Metrics: Fertility, Compactness, and Token Efficiency

4. Data Sources, Preprocessing, and Vocabulary

5. Ablation Studies and Robustness

6. Empirical Evaluation

6.1 Intrinsic Metrics

6.2 Inference Throughput

6.3 Downstream Task Performance

7. Implementation and Integration Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

IST: Efficient Tokenization for Indic Languages

1. Motivation and Design Principles

2. Tokenization Pipeline and Algorithm

2.1 Unicode Normalization

2.2 Script-Agnostic Pre-tokenization

2.3 Two-Stage Subword–Superword BPE

2.4 Byte Fallback

3. Metrics: Fertility, Compactness, and Token Efficiency

4. Data Sources, Preprocessing, and Vocabulary

5. Ablation Studies and Robustness

6. Empirical Evaluation

6.1 Intrinsic Metrics

6.2 Inference Throughput

6.3 Downstream Task Performance

7. Implementation and Integration Guidance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research