Typhoon-S: Sovereign LLM Post-Training
- Typhoon-S is a minimal, open post-training pipeline that transforms base LLMs into efficient general-purpose and domain-specific assistants.
- It uses a three-stage process—supervised fine-tuning, on-policy distillation, and reinforcement fine-tuning—to ensure efficiency and transparency.
- The approach achieves competitive performance on multilingual benchmarks, excelling in low-resource, sovereign applications like Thai legal reasoning.
Typhoon-S is a minimal, open post-training recipe for transforming LLMs into both general-purpose assistants and specialists for high-stakes, region- or domain-specific applications under academic-scale compute and data constraints. It is explicitly designed to address the dual challenges faced in sovereign settings: the need to retain local control and transparency over model weights, training data, and deployment; and the requirement to operate under limited computing resources and transparent recipes, in stark contrast to frontier LLM pipelines reliant on massive instruction corpora, multi-stage reinforcement learning with human feedback, and industrial-scale GPU clusters (Pipatanakul et al., 26 Jan 2026).
1. Motivation and Core Definitions
The frontier of LLM development is characterized by centralized resource gatekeeping: most models are developed by a small number of organizations with privileged access to compute and data, predominantly in high-resource languages such as English and Chinese. Sovereign LLM efforts—undertaken by regional or national institutions—encounter two critical barriers:
- Compute/Data Constraints: Academic and public-sector entities generally lack access to large GPU clusters (e.g., 1,000+ H100s) and cannot amass instruction datasets on the order of hundreds of millions of examples.
- Transparency Requirements: Sovereign or public deployments demand open, auditable pipelines, including full visibility into all data and model updates.
Typhoon-S articulates two complementary objectives for post-training under these constraints:
- Adoptability: The rapid and efficient conversion of a base model (open-weight or sovereign-adapted) into a general-purpose instruction-following assistant capable of chat, math, code, tool use, and robust multilingual handling, using a data-efficient regimen and modest compute (≤8 GPUs, a few hundred thousand examples, ≈2 days).
- Sovereign Capability: The additional specialization of the above assistant to solve domain- or region-specific tasks (e.g., Thai legal reasoning) through the targeted injection of in-domain knowledge and agentic abilities, using small-scale reinforcement learning (RL) and auxiliary learning stages.
2. Typhoon-S Post-Training Workflow
Typhoon-S comprises a three-stage post-training pipeline, each designed to maximize efficiency and transparency:
A. Supervised Fine-Tuning (SFT)
This initial stage starts from a base model and minimizes the standard cross-entropy loss over a curated instruction–response corpus :
The SFT set (≈340 K examples) balances:
- 200 K general-purpose instructions (Tulu 3)
- 100 K tool-use/agentic examples (Toucan Tool)
- 40 K Thai AutoIF-generated instruction–response pairs
B. On-Policy Distillation (OPD)
To bridge the train–inference distribution gap of conventional (offline) distillation, Typhoon-S uses Generalized Knowledge Distillation (GKD) on student model rollouts. At each step, with probability :
- Either sample from (offline)
- Or generate then query the teacher for logits on that trajectory (on-policy)
The student then minimizes:
Full-logits distillation is preferred for robustness on code-switching and complex tasks. A top- token approach is possible for lower compute but is less robust, particularly for language mixing.
C. Small-scale Reinforcement Fine-Tuning (RFT) with InK-GRPO
To inject factual and procedural knowledge not present in pretraining, Typhoon-S invokes compact RL fine-tuning, using the GRPO (Generalized Reward Policy Optimization, a PPO variant) objective. This is augmented with an auxiliary cross-entropy term on in-domain corpus (e.g., legal statutes):
Where controls randomization (, ). In agentic settings, the model interacts with “search” and “read” tools over a FAISS-indexed corpus, optimizing end-to-end for final-answer accuracy.
3. Thai Language Case Study
The Thai case exemplifies Typhoon-S in a live, low-resource, high-stakes context, leveraging both multilingual and domain-specific alignment:
Datasets
- English: Tulu 3 (200 K), Toucan Tool (100 K)
- Thai Instructions: Sourced from translated WildChat, WangchanThaiInstruct, Han, Typhoon Instruct
- Response Generation: Thai AutoIF, using large-scale teacher LLMs with code-verifiable criteria
- Augmentation: Constraint translation (EN ↔ TH), variation in prompt placement
- Domain Supervision: NitiBench-CCL and MIRAGE-Bench (Thai legal, used for RL and CE data)
Compute Footprint & Model Sizes
- Adoptability (8B): 8× H100, ≈2 days (SFT+OPD)
- Sovereign agent (4B): 4× H100, ≈1 day (agentic InK-GRPO)
| Stage | Data Size | Compute | Core Hyperparameters |
|---|---|---|---|
| SFT | 340K | 8×H100, 2 days | AdamW, lr=2e-5, batch=32 |
| OPD | 160K | 8×H100, fused w/ SFT | lr=1e-6, λ=0.25, 1 epoch |
| InK-GRPO (RFT) | 160K | 4×H100, 1 day | lr=1e-6, ρ=0.6, λ=0.1 |
4. Evaluation Protocols and Empirical Outcomes
Typhoon-S employs a full sweep of multilingual, agentic, and sovereign task benchmarks:
General Capabilities (EN+TH)
- MT-Bench EN/TH: LLM-as-judge, helpfulness/correctness
- IFEval EN/TH: Factual, verifiable constraints
- Code-switching robustness: Realistic TH–EN mixing
- Knowledge & Reasoning: GPQA (EN), MMLU Pro X (TH), OpenThaiEval (TH-native)
- Math: MATH500 EN/TH
- Code reasoning: LiveCodeBench
- Tool/agentic: BFCL (tool use), HotpotQA EN/TH (RAG)
Sovereign Benchmarks
- NitiBench (Thai legal QA accuracy)
- MIRAGE-Bench (TH legal domain)
Key Results
- SFT alone results in performance deficits and brittleness (avg 37.45, code-switching 65.4, agentic 0) compared to strong baselines (48.07, 96.2).
- Addition of OPD yields significant improvements (avg +6.5 pts to 43.94; code-switching to 93.4), recovers robust agentic behavior, and maintains base knowledge.
- Full-logits OPD dramatically outperforms top- (code-switching: 93.4 vs 69.8).
- Thai-specific data is vital: removing it results in ≈4 pt drop in SFT and impacts sovereign task performance after OPD.
- When applied to a sovereign-adapted base (ThaiLLM-8B), Typhoon-S (SFT+OPD) achieves superior performance (Thai avg 71.20, Qwen3-8B 66.66) and competitive overall scores (49.99 vs 54.02).
- InK-GRPO improves sovereign task accuracy on NitiBench (19.30% vs 15.82%) and MIRAGE (22.63% vs 20.99%), as well as agentic settings (NitiBench 78.02% vs GRPO 73.73% and GPT-5+Agent 75.34%).
- General-purpose performance remains stable across RFT variants (avg ≈48–49 pts), indicating no catastrophic forgetting.
5. Trade-offs, Limitations, and Implications
Typhoon-S attains competitive region-specific and general performance using a fraction of the data and compute of mainstream LLM regimes:
- Data/Compute Efficiency: 340K SFT + 160K OPD instructions (vs millions commonly used); two days on 8 GPUs (adoptability), one day on 4 GPUs (sovereign agent).
- Performance: Comparable or superior to state-of-the-art open-weight baselines on Thai-centric and general benchmarks.
- Transparency: All stages are open and auditable at the token and gradient level, supporting stringent sovereign oversight.
Limitations
- No exploration of pre-training or mid-training under resource constraints.
- Fixed hyperparameters (e.g., )—extensive ablations and tuning deferred.
- The Thai focus reflects available data and expertise; extension to additional low-resource environments remains an open direction.
- Long-term effects of repeated InK-GRPO (e.g., knowledge saturation, model drift) require further investigation.
A plausible implication is that Typhoon-S enables credible sovereign alternatives to closed LLMs without the need for massive training budgets, provided strong local data curation and staged post-training are feasible.
6. Prospects for Sovereign LLM Workflows
Typhoon-S provides a reproducible and minimal blueprint for regions or domains seeking to democratize advanced LLM capabilities under real-world resource limits. By decomposing LLM post-training into SFT, on-policy distillation, and compact agentic RFT/knowledge injection, it sidesteps reliance on proprietary data, closed tools, and large-scale infrastructure. Its demonstrated stability and efficiency suggest a practical path for the broader adoption of sovereign LLMs in diverse settings—pending future validation beyond the initial Thai case and further methodological refinements (Pipatanakul et al., 26 Jan 2026).