Typhoon-S: Sovereign LLM Post-Training

Updated 2 February 2026

Typhoon-S is a minimal, open post-training pipeline that transforms base LLMs into efficient general-purpose and domain-specific assistants.
It uses a three-stage process—supervised fine-tuning, on-policy distillation, and reinforcement fine-tuning—to ensure efficiency and transparency.
The approach achieves competitive performance on multilingual benchmarks, excelling in low-resource, sovereign applications like Thai legal reasoning.

Typhoon-S is a minimal, open post-training recipe for transforming LLMs into both general-purpose assistants and specialists for high-stakes, region- or domain-specific applications under academic-scale compute and data constraints. It is explicitly designed to address the dual challenges faced in sovereign settings: the need to retain local control and transparency over model weights, training data, and deployment; and the requirement to operate under limited computing resources and transparent recipes, in stark contrast to frontier LLM pipelines reliant on massive instruction corpora, multi-stage reinforcement learning with human feedback, and industrial-scale GPU clusters (Pipatanakul et al., 26 Jan 2026).

1. Motivation and Core Definitions

The frontier of LLM development is characterized by centralized resource gatekeeping: most models are developed by a small number of organizations with privileged access to compute and data, predominantly in high-resource languages such as English and Chinese. Sovereign LLM efforts—undertaken by regional or national institutions—encounter two critical barriers:

Compute/Data Constraints: Academic and public-sector entities generally lack access to large GPU clusters (e.g., 1,000+ H100s) and cannot amass instruction datasets on the order of hundreds of millions of examples.
Transparency Requirements: Sovereign or public deployments demand open, auditable pipelines, including full visibility into all data and model updates.

Typhoon-S articulates two complementary objectives for post-training under these constraints:

Adoptability: The rapid and efficient conversion of a base model (open-weight or sovereign-adapted) into a general-purpose instruction-following assistant capable of chat, math, code, tool use, and robust multilingual handling, using a data-efficient regimen and modest compute (≤8 GPUs, a few hundred thousand examples, ≈2 days).
Sovereign Capability: The additional specialization of the above assistant to solve domain- or region-specific tasks (e.g., Thai legal reasoning) through the targeted injection of in-domain knowledge and agentic abilities, using small-scale reinforcement learning (RL) and auxiliary learning stages.

2. Typhoon-S Post-Training Workflow

Typhoon-S comprises a three-stage post-training pipeline, each designed to maximize efficiency and transparency:

A. Supervised Fine-Tuning (SFT)

This initial stage starts from a base model $p_\theta$ and minimizes the standard cross-entropy loss over a curated instruction–response corpus $\mathcal{D}_{\text{SFT}}$ :

$\mathcal{L}_{\text{SFT}} = -\,\mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \sum_{t=1}^{T} \log p_\theta(y_t \mid x, y_{<t})$

The SFT set (≈340 K examples) balances:

200 K general-purpose instructions (Tulu 3)
100 K tool-use/agentic examples (Toucan Tool)
40 K Thai AutoIF-generated instruction–response pairs

B. On-Policy Distillation (OPD)

To bridge the train–inference distribution gap of conventional (offline) distillation, Typhoon-S uses Generalized Knowledge Distillation (GKD) on student model rollouts. At each step, with probability $\lambda=0.25$ :

Either sample $(x,y)$ from $\mathcal{D}_{\text{SFT}}$ (offline)
Or generate $y \sim p_S(\cdot\mid x)$ then query the teacher $p_T$ for logits on that trajectory (on-policy)

The student then minimizes:

$\mathcal{L}_{\text{KD}} = \mathbb{E}\left[\sum_{t=1}^{T} D_{\text{KL}}(p_T(\cdot \mid x, y_{<t}) \,\|\, p_S(\cdot \mid x, y_{<t}))\right]$

Full-logits distillation is preferred for robustness on code-switching and complex tasks. A top- $K$ token approach is possible for lower compute but is less robust, particularly for language mixing.

C. Small-scale Reinforcement Fine-Tuning (RFT) with InK-GRPO

To inject factual and procedural knowledge not present in pretraining, Typhoon-S invokes compact RL fine-tuning, using the GRPO (Generalized Reward Policy Optimization, a PPO variant) objective. This is augmented with an auxiliary cross-entropy term on in-domain corpus (e.g., legal statutes):

$\mathcal{D}_{\text{SFT}}$ 0

Where $\mathcal{D}_{\text{SFT}}$ 1 controls randomization ( $\mathcal{D}_{\text{SFT}}$ 2, $\mathcal{D}_{\text{SFT}}$ 3). In agentic settings, the model interacts with “search” and “read” tools over a FAISS-indexed corpus, optimizing end-to-end for final-answer accuracy.

3. Thai Language Case Study

The Thai case exemplifies Typhoon-S in a live, low-resource, high-stakes context, leveraging both multilingual and domain-specific alignment:

Datasets

English: Tulu 3 (200 K), Toucan Tool (100 K)
Thai Instructions: Sourced from translated WildChat, WangchanThaiInstruct, Han, Typhoon Instruct
Response Generation: Thai AutoIF, using large-scale teacher LLMs with code-verifiable criteria
Augmentation: Constraint translation (EN ↔ TH), variation in prompt placement
Domain Supervision: NitiBench-CCL and MIRAGE-Bench (Thai legal, used for RL and CE data)

Compute Footprint & Model Sizes

Adoptability (8B): 8× H100, ≈2 days (SFT+OPD)
Sovereign agent (4B): 4× H100, ≈1 day (agentic InK-GRPO)

Stage	Data Size	Compute	Core Hyperparameters
SFT	340K	8×H100, 2 days	AdamW, lr=2e-5, batch=32
OPD	160K	8×H100, fused w/ SFT	lr=1e-6, λ=0.25, 1 epoch
InK-GRPO (RFT)	160K	4×H100, 1 day	lr=1e-6, ρ=0.6, λ=0.1

4. Evaluation Protocols and Empirical Outcomes

Typhoon-S employs a full sweep of multilingual, agentic, and sovereign task benchmarks:

General Capabilities (EN+TH)

MT-Bench EN/TH: LLM-as-judge, helpfulness/correctness
IFEval EN/TH: Factual, verifiable constraints
Code-switching robustness: Realistic TH–EN mixing
Knowledge & Reasoning: GPQA (EN), MMLU Pro X (TH), OpenThaiEval (TH-native)
Math: MATH500 EN/TH
Code reasoning: LiveCodeBench
Tool/agentic: BFCL (tool use), HotpotQA EN/TH (RAG)

Sovereign Benchmarks

NitiBench (Thai legal QA accuracy)
MIRAGE-Bench (TH legal domain)

Key Results

SFT alone results in performance deficits and brittleness (avg 37.45, code-switching 65.4, agentic 0) compared to strong baselines (48.07, 96.2).
Addition of OPD yields significant improvements (avg +6.5 pts to 43.94; code-switching to 93.4), recovers robust agentic behavior, and maintains base knowledge.
Full-logits OPD dramatically outperforms top- $\mathcal{D}_{\text{SFT}}$ 4 (code-switching: 93.4 vs 69.8).
Thai-specific data is vital: removing it results in ≈4 pt drop in SFT and impacts sovereign task performance after OPD.
When applied to a sovereign-adapted base (ThaiLLM-8B), Typhoon-S (SFT+OPD) achieves superior performance (Thai avg 71.20, Qwen3-8B 66.66) and competitive overall scores (49.99 vs 54.02).
InK-GRPO improves sovereign task accuracy on NitiBench (19.30% vs 15.82%) and MIRAGE (22.63% vs 20.99%), as well as agentic settings (NitiBench 78.02% vs GRPO 73.73% and GPT-5+Agent 75.34%).
General-purpose performance remains stable across RFT variants (avg ≈48–49 pts), indicating no catastrophic forgetting.

5. Trade-offs, Limitations, and Implications

Typhoon-S attains competitive region-specific and general performance using a fraction of the data and compute of mainstream LLM regimes:

Data/Compute Efficiency: 340K SFT + 160K OPD instructions (vs millions commonly used); two days on 8 GPUs (adoptability), one day on 4 GPUs (sovereign agent).
Performance: Comparable or superior to state-of-the-art open-weight baselines on Thai-centric and general benchmarks.
Transparency: All stages are open and auditable at the token and gradient level, supporting stringent sovereign oversight.

Limitations

No exploration of pre-training or mid-training under resource constraints.
Fixed hyperparameters (e.g., $\mathcal{D}_{\text{SFT}}$ 5)—extensive ablations and tuning deferred.
The Thai focus reflects available data and expertise; extension to additional low-resource environments remains an open direction.
Long-term effects of repeated InK-GRPO (e.g., knowledge saturation, model drift) require further investigation.

A plausible implication is that Typhoon-S enables credible sovereign alternatives to closed LLMs without the need for massive training budgets, provided strong local data curation and staged post-training are feasible.

6. Prospects for Sovereign LLM Workflows

Typhoon-S provides a reproducible and minimal blueprint for regions or domains seeking to democratize advanced LLM capabilities under real-world resource limits. By decomposing LLM post-training into SFT, on-policy distillation, and compact agentic RFT/knowledge injection, it sidesteps reliance on proprietary data, closed tools, and large-scale infrastructure. Its demonstrated stability and efficiency suggest a practical path for the broader adoption of sovereign LLMs in diverse settings—pending future validation beyond the initial Thai case and further methodological refinements (Pipatanakul et al., 26 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Typhoon-S: Minimal Open Post-Training for Sovereign Large Language Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Typhoon S.

Typhoon-S: Sovereign LLM Post-Training

1. Motivation and Core Definitions

2. Typhoon-S Post-Training Workflow

3. Thai Language Case Study

4. Evaluation Protocols and Empirical Outcomes

5. Trade-offs, Limitations, and Implications

Limitations

6. Prospects for Sovereign LLM Workflows

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Typhoon-S: Sovereign LLM Post-Training

1. Motivation and Core Definitions

2. Typhoon-S Post-Training Workflow

3. Thai Language Case Study

4. Evaluation Protocols and Empirical Outcomes

5. Trade-offs, Limitations, and Implications

Limitations

6. Prospects for Sovereign LLM Workflows

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research