Papers
Topics
Authors
Recent
2000 character limit reached

AraToken: Arabic Tokenization & Adaptation

Updated 27 December 2025
  • AraToken is an Arabic-optimized tokenization system that combines specialized normalization and a probabilistic SentencePiece Unigram model to effectively handle Arabic morphology.
  • It employs a multi-stage normalization process—including Unicode NFKC, Alif unification, numeral and punctuation standardization—to improve token sequence compression and finetuning performance.
  • The Language Extension Pipeline (LEP) integrates new vocabulary into existing LLMs by merging tokens and selectively updating embeddings, eliminating the need for full model retraining.

AraToken is an Arabic-optimized tokenization and language adaptation system designed to address the limitations of generic tokenizers used in LLMs when applied to morphologically rich languages such as Arabic. By combining a specialized normalization pipeline with a probabilistic SentencePiece Unigram model, AraToken demonstrates substantial improvements in token sequence compression and subsequent model finetuning performance. A central architectural contribution is the Language Extension Pipeline (LEP), which efficiently integrates the improved tokenizer into existing LLMs such as Qwen3-0.6B through targeted vocabulary and embedding adaptation, eliminating the requirement for full model retraining (Kashirskiy et al., 20 Dec 2025).

1. Normalization Pipeline

AraToken employs a multi-stage normalization process prior to subword tokenization, implemented via the HuggingFace Tokenizers library. The objective is to reduce orthographic diversity in Arabic text, which in turn improves vocabulary efficiency and downstream model performance. The stages are:

  1. Unicode NFKC Normalization: All input is normalized using Unicode NFKC (Normalization Form KC), decomposing compatibility characters and recomposing them into canonical forms.
  2. Alif Variant Unification: Four commonly occurring "Alif" variants are mapped to bare U+0627 (ALIF). Specifically, U+0623 (Hamza-above), U+0625 (Hamza-below), U+0622 (Madda), and U+0671 (Wasla) are all replaced by U+0627.
  3. Numeral Standardization: Arabic-Indic digits (U+0660–U+0669) are mapped to Western numerals ‘0’–‘9’. Separators such as U+066B (decimal) and U+066C (thousand) are mapped to '.' and ','.
  4. Punctuation Normalization: Arabic-specific punctuation, including U+061F (question mark), U+061B (semicolon), and U+060C (comma), are converted to their Western equivalents.
  5. Tatweel Removal: The kashida character (U+0640) is removed entirely.
  6. Diacritic Handling: Two configurations are supported: (A) drop all diacritics (U+0610–U+061A, U+064B–U+065F); (B) keep diacritics.

The cumulative normalization process can be formalized:

Norm(c)={ALIFif c{Hamza-above,}  digit(c)if c{Arabic-Indic numerals}  ϵif c=Tatweel cotherwise\mathrm{Norm}(c) = \begin{cases} \text{ALIF} & \text{if }c\in\{\text{Hamza-above}, \ldots\}\ \ \text{digit}(c) & \text{if }c\in\{\text{Arabic-Indic numerals}\}\ \ \epsilon & \text{if }c = \text{Tatweel} \ c & \text{otherwise} \end{cases}

This systematic normalization is essential for maximizing subword overlap and reducing superficial vocabulary bloat arising from orthographic variation (Kashirskiy et al., 20 Dec 2025).

2. Tokenization Algorithm Selection and Performance

AraToken compares three prominent subword tokenization algorithms: Byte-Pair Encoding (BPE), WordPiece, and the SentencePiece Unigram model, each evaluated with and without normalization. The rationale for selecting SentencePiece Unigram is grounded in its probabilistic segmentation, which is highly effective for Arabic’s agglutinative morphology.

  • BPE and WordPiece rely on greedy, deterministic merge operations that can result in inefficiencies and are less adaptable to rare or inflected words.
  • Unigram employs a probabilistic model, maximizing the marginal likelihood over all possible segmentations:

L(V)=xXlog(sSeg(x)P(s))\mathcal{L}(V) = \sum_{x\in X} \log\Bigl(\sum_{s\in\mathrm{Seg}(x)} P(s)\Bigr)

where P(s)=tsp(t)P(s) = \prod_{t\in s} p(t), and p(t)p(t) is the probability of subtoken tt.

Performance is measured with the fertility metric F=T/WF = T/W, i.e., the average number of subwords per word given TT subwords and WW words in held-out data.

Algorithm Fertility (F) Compression (chars/token) OOV (%)
bpe_drop_norm 1.243 4.85 0.0
wp_drop_norm 1.244 4.85 0.0
sp_drop_norm 1.199 5.03 0.10
bpe_drop (raw) ~1.350

With normalization and diacritic removal, Unigram (sp_drop_norm) achieves a fertility of 1.199, an 8.5% improvement over the raw Unigram baseline (1.311), and an 18% improvement over the unnormalized BPE baseline (1.350). Vocabulary pruning experiments show that 99% coverage (76K tokens) presents an effective trade-off between fertility and compression (Kashirskiy et al., 20 Dec 2025).

3. Language Extension Pipeline (LEP) for Model Adaptation

The Language Extension Pipeline (LEP) is a systematic approach for retrofitting Arabic tokenization and vocabulary into Qwen3-0.6B without exhaustive retraining.

  • Vocabulary Extension: The new SentencePiece vocabulary VarV_{ar} is merged with the original Qwen3 vocabulary VqwenV_{qwen}. Newly introduced tokens ΔV=VarVqwen\Delta V = V_{ar} \setminus V_{qwen} are filtered to exclude non-Arabic elements.
  • Mean Subtoken Embedding Initialization: Each new token tΔVt \in \Delta V is encoded via the old tokenizer as a sequence of subtokens S={s1,,sk}S = \{ s_1, \ldots, s_k \}, and its embedding is initialized as the arithmetic mean:

et=1Si=1Sesi\mathbf{e}_t = \frac{1}{|S|}\sum_{i=1}^{|S|} \mathbf{e}_{s_i}

  • Gradient Masking: During adaptation, gradients for original embeddings are masked (eiL=0\nabla_{e_i}\mathcal{L} = 0 for i<Vqweni < |V_{qwen}|), freezing their values, while permitting updates for new tokens.
  • Selective Transformer Layer Unfreezing: Only the upper 4 layers (24–27 of 28 total) are unfrozen. This restricts adaptation to higher model representations, stabilizing core multilingual knowledge.
  • Fine-tuning Protocol: Training uses 100K Modern Standard Arabic samples and 2K validation samples. Key settings include sequence length 256, batch size 16 (gradient accumulation 6), AdamW optimizer, learning rate 2×1042 \times 10^{-4}, standard autoregressive cross-entropy loss, and 800 maximum steps.

4. Empirical Evaluation

Comprehensive experimentation validates the effectiveness of AraToken and LEP:

  • Intrinsic Tokenizer Metrics: SentencePiece Unigram plus normalization (sp_drop_norm) consistently yields the lowest fertility (1.199), superior compression rates, and near-zero OOV.
  • Vocabulary Pruning: At 99% coverage (76K tokens), fertility is 1.229 and compression is 4.91 chars/token. Further pruning (down to 95% coverage) increases fertility to 1.293.
  • Model Adaptation Outcomes: Fine-tuning with LEP rapidly reduces evaluation loss from 8.28 (pre-adaptation) to 2.43 after 800 steps.
  • Ablation Study: Key factors impacting performance include Alif normalization, learning rate, number of unfrozen layers:
    • Unified Alif normalization, LR=2×1042 \times 10^{-4}, 4 layers unfrozen: Eval Loss=3.03 (–63% vs. baseline)
    • Alif4 variant, LR=2×1042 \times 10^{-4}, 4 layers unfrozen: Eval Loss=2.43 (–71%)
    • Alif4 variant, LR=2×1042 \times 10^{-4}, all layers frozen: Eval Loss=4.02 (–51%)

No statistical significance tests are reported, but the relative magnitude of improvements is substantial.

5. Implementation Artifacts and Reproducibility

AraToken provides a complete, open-source release:

  • Tokenizer code (normalization pipeline and SentencePiece Unigram training scripts)
  • LEP fine-tuning scripts (vocabulary extension, embedding initialization, gradient masking, selective transformer unfreezing)
  • Pretrained and adapted Qwen3-0.6B model checkpoints
  • Configuration files for optimizer, scheduler, and hyperparameters
  • A comprehensive README detailing the reproduction pipeline:

    1. Normalize and prepare the Arabic corpus.
    2. Train the SentencePiece Unigram tokenizer (vocab=150K; prune to 99% coverage).
    3. Merge vocabularies and initialize new embeddings with the LEP scripts.
    4. Conduct fine-tuning for 800 steps.
    5. Evaluate adaptation performance on the validation set.

All associated artifacts and instructions are distributed via the project’s public repository, enabling exact reproduction of reported results and facilitating further Arabic NLP research (Kashirskiy et al., 20 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to AraToken.