Papers
Topics
Authors
Recent
2000 character limit reached

Persian-Phi: Compact LLM for Persian

Updated 15 December 2025
  • The paper introduces Persian-Phi, a compact 3.8B parameter transformer adapted for Persian using a structured curriculum learning pipeline and parameter-efficient fine-tuning.
  • It pioneers a Persian-specific BPE tokenizer and low-rank adaptation to align token embeddings, significantly reducing input sequence lengths and computational costs.
  • The approach demonstrates effective cross-lingual adaptation with modest hardware, offering a scalable framework for developing high-quality LLMs for low-resource languages.

Persian-Phi is a compact 3.8-billion-parameter transformer-based LLM adapted from Microsoft Phi-3 Mini for the Persian language. Developed by extending a monolingual English architecture, Persian-Phi introduces a novel curriculum learning pipeline coupled with parameter-efficient fine-tuning to address the challenge of building high-quality LLMs for low-resource languages with modest compute. The approach demonstrates that robust cross-lingual adaptation is achievable without reliance on massive multilingual baselines or prohibitive hardware resources (Akhlaghi et al., 8 Dec 2025).

1. Base Architecture and Persian-Specific Modifications

Microsoft Phi-3 Mini serves as the initialization point for Persian-Phi, featuring a 3.8B parameter transformer employing LongRoPE attention and a 4K context window. The original LLaMA-2 tokenizer, limited to English and a few Persian characters, was insufficient for rich representation of Persian text. Persian-Phi addresses this by training a new Byte Pair Encoding (BPE) tokenizer (vocabulary size 5,000) on normalized Persian Wikipedia data. This process yielded 4,921 newly introduced tokens, which were merged with the base vocabulary, effectively halving average input sequence lengths for Persian content (Table 2). The model’s embedding matrix and output head were resized to accommodate the extended token set, with new parameters randomly initialized pending later alignment.

2. Curriculum Learning Pipeline

The adaptation follows a structured four-stage pipeline:

  1. Tokenizer Preparation: A Persian-only BPE tokenizer is constructed and merged with the base tokenizer using Hazm and DadmaTools, adding 4,921 tokens specific to Persian.
  2. New Parameters Warm-up: Alignment of Persian embeddings with English is achieved via next-token prediction on a parallel bilingual corpus, consisting of synthetically generated “Tiny Stories” (≈500M tokens, English-to-Persian translations via Google Translate). Training utilizes Low-Rank Adaptation (LoRA) with rank 4 (≈6M parameters) and full fine-tuning on new embeddings and output head. Bilingual validation perplexity reduced to 2.45, with embedding alignment confirmed by diagonal dominance in cosine similarity matrices:

Sim(een,efa)=eenefaeenefa\text{Sim}(e_{\mathrm{en}}, e_{\mathrm{fa}}) = \frac{{e^{\mathrm{en}} \cdot e^{\mathrm{fa}}}}{{\|e^{\mathrm{en}}\| \|e^{\mathrm{fa}}\|}}

  1. Continual Pre-training: The model is further trained on curated monolingual Persian datasets—Targoman Large Persian Corpus (TLPC, 2.1M chunks), Persian Wikipedia (182K chunks, knowledge doubled), and translated Tiny Stories (32K chunks), totaling ≈4.74B tokens. The pipeline incorporates rigorous filtering (FastText language ID, YJC news), normalization, and deduplication (MinHash). Training employs PEFT via LoRA (rank 64, α=32) over attention and feed-forward layers (≈200M parameters) and full fine-tuning of new token embedding and head (≈120M parameters, ≈8% model weights) with 8-bit AdamW optimizer and bfloat16+TF32 precision on 2 RTX 3090 GPUs, achieving ≈5000 tok/s throughput over 12 days.
  2. Supervised Fine-Tuning (SFT/Instruct Tuning): To restore and enhance instruction-following capabilities, the model is tuned on 143K Persian and English instruction pairs (Bactrian-X, Aya, TED2020), with PEFT via LoRA (rank 32) on attention, feed-forward, and new embeddings. SFT loss is computed only for “assistant” tokens:

LSFT=tAlogPθ(atc<t)\mathcal{L}_{\mathrm{SFT}} = -\sum_{t \in \mathcal{A}} \log P_{\theta}(a_t \mid c_{<t})

where A\mathcal{A} denotes assistant tokens.

3. Training Objectives and Optimization

Persian-Phi employs standard next-token cross-entropy loss for language modeling during both warm-up and pre-training stages:

LLM=1Ni=1Nt=1TilogPθ(wi,twi,<t)\mathcal{L}_{\mathrm{LM}} = -\frac{1}{N}\sum_{i=1}^{N} \sum_{t=1}^{T_i} \log P_{\theta}(w_{i,t} \mid w_{i,<t})

Instruction-following is optimized via assistant token masking within the SFT loss. Embedding alignment, key to cross-lingual transfer, is validated by high cosine similarity between Persian and English token embeddings post warm-up. Optimization utilizes 8-bit AdamW (lr=1e4lr=1\mathrm{e}{-4}, wd=1e4wd=1\mathrm{e}{-4}, β1=0.9\beta_1=0.9, β2=0.95\beta_2=0.95) and a cosine decay schedule with initial 250-step warm-up.

4. Benchmarking and Performance

Table 1 compares Persian-Phi against other open-source Persian LLMs on the Open Persian LLM Leaderboard (Table 5 in the paper):

Model #Params (B) Part MC ARC Easy ARC Challenge MMLU-Pro AUT MC
PartAI Dorna2-8B 8.03 35.52 75.28 53.52 24.10 53.45
Meta-LLaMA3.1-8B 8.03 36.68 78.40 60.40 21.00 54.24
Gemma-2-2B-it 2.61 31.12 71.26 57.72 16.23 49.90
Persian-Phi 3.85 30.56 64.65 51.00 17.18 43.98
PersianMind-v1.0 6.82 29.27 58.91 48.32 15.51 45.36
Maral-7B-alpha-1 7.24 26.67 44.54 32.88 15.99 36.09
Phi-3-mini-4k-instruct 3.82 27.37 36.78 36.78 17.89 35.10

Persian-Phi attains approximately 80% of the aggregate score of the larger (8B) Dorna2, outperforming equivalently sized or larger fine-tuned Llama-based models. This attests to the effectiveness of the curriculum learning and PEFT strategy given the model's compactness.

5. Computational Efficiency and Deployment Implications

The full adaptation process was executed on two consumer-grade RTX 3090 GPUs over ~12 days, leveraging PEFT to restrict complete updates to only 8% of the model’s parameters—yielding substantial savings in memory and computational cost compared to full-model fine-tuning. Training data requirements are modest relative to conventional LLM pre-training: 500M tokens (“Tiny Stories”) for warm-up, 4.74B tokens (TLPC, Wikipedia, Tiny Stories) for continual pre-training, and 143K instruction pairs for SFT.

A plausible implication is that the pipeline enables resource-constrained researchers to extend compact monolingual English LLMs to other low-resource languages by replicating the structured curriculum and PEFT methodology. The process stabilizes cross-lingual adaptation and mitigates catastrophic forgetting, offering a scalable framework for democratizing LLM development beyond widely represented languages.

6. Generalization Potential and Significance

Persian-Phi demonstrates that curriculum learning—sequencing warm-up, continual pre-training, and supervised fine-tuning—enables effective cross-lingual adaptation to underrepresented languages. The proposed pipeline is validated by competitive leaderboard results despite a smaller parameter count and reduced hardware requirements. This suggests that similar strategies are viable for broadening the reach of high-quality LLMs without relying on large multilingual base models, contributing to inclusive AI research and lowering barriers for language technology in resource-scarce contexts (Akhlaghi et al., 8 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Persian-Phi Model.