Papers
Topics
Authors
Recent
Search
2000 character limit reached

Typhoon-S: Sovereign LLM Post-Training

Updated 2 February 2026
  • Typhoon-S is a minimal, open post-training pipeline that transforms base LLMs into efficient general-purpose and domain-specific assistants.
  • It uses a three-stage process—supervised fine-tuning, on-policy distillation, and reinforcement fine-tuning—to ensure efficiency and transparency.
  • The approach achieves competitive performance on multilingual benchmarks, excelling in low-resource, sovereign applications like Thai legal reasoning.

Typhoon-S is a minimal, open post-training recipe for transforming LLMs into both general-purpose assistants and specialists for high-stakes, region- or domain-specific applications under academic-scale compute and data constraints. It is explicitly designed to address the dual challenges faced in sovereign settings: the need to retain local control and transparency over model weights, training data, and deployment; and the requirement to operate under limited computing resources and transparent recipes, in stark contrast to frontier LLM pipelines reliant on massive instruction corpora, multi-stage reinforcement learning with human feedback, and industrial-scale GPU clusters (Pipatanakul et al., 26 Jan 2026).

1. Motivation and Core Definitions

The frontier of LLM development is characterized by centralized resource gatekeeping: most models are developed by a small number of organizations with privileged access to compute and data, predominantly in high-resource languages such as English and Chinese. Sovereign LLM efforts—undertaken by regional or national institutions—encounter two critical barriers:

  • Compute/Data Constraints: Academic and public-sector entities generally lack access to large GPU clusters (e.g., 1,000+ H100s) and cannot amass instruction datasets on the order of hundreds of millions of examples.
  • Transparency Requirements: Sovereign or public deployments demand open, auditable pipelines, including full visibility into all data and model updates.

Typhoon-S articulates two complementary objectives for post-training under these constraints:

  • Adoptability: The rapid and efficient conversion of a base model (open-weight or sovereign-adapted) into a general-purpose instruction-following assistant capable of chat, math, code, tool use, and robust multilingual handling, using a data-efficient regimen and modest compute (≤8 GPUs, a few hundred thousand examples, ≈2 days).
  • Sovereign Capability: The additional specialization of the above assistant to solve domain- or region-specific tasks (e.g., Thai legal reasoning) through the targeted injection of in-domain knowledge and agentic abilities, using small-scale reinforcement learning (RL) and auxiliary learning stages.

2. Typhoon-S Post-Training Workflow

Typhoon-S comprises a three-stage post-training pipeline, each designed to maximize efficiency and transparency:

A. Supervised Fine-Tuning (SFT)

This initial stage starts from a base model pθp_\theta and minimizes the standard cross-entropy loss over a curated instruction–response corpus DSFT\mathcal{D}_{\text{SFT}}:

LSFT=E(x,y)DSFTt=1Tlogpθ(ytx,y<t)\mathcal{L}_{\text{SFT}} = -\,\mathbb{E}_{(x,y) \sim \mathcal{D}_{\text{SFT}}} \sum_{t=1}^{T} \log p_\theta(y_t \mid x, y_{<t})

The SFT set (≈340 K examples) balances:

  • 200 K general-purpose instructions (Tulu 3)
  • 100 K tool-use/agentic examples (Toucan Tool)
  • 40 K Thai AutoIF-generated instruction–response pairs

B. On-Policy Distillation (OPD)

To bridge the train–inference distribution gap of conventional (offline) distillation, Typhoon-S uses Generalized Knowledge Distillation (GKD) on student model rollouts. At each step, with probability λ=0.25\lambda=0.25:

  • Either sample (x,y)(x,y) from DSFT\mathcal{D}_{\text{SFT}} (offline)
  • Or generate ypS(x)y \sim p_S(\cdot\mid x) then query the teacher pTp_T for logits on that trajectory (on-policy)

The student then minimizes:

LKD=E[t=1TDKL(pT(x,y<t)pS(x,y<t))]\mathcal{L}_{\text{KD}} = \mathbb{E}\left[\sum_{t=1}^{T} D_{\text{KL}}(p_T(\cdot \mid x, y_{<t}) \,\|\, p_S(\cdot \mid x, y_{<t}))\right]

Full-logits distillation is preferred for robustness on code-switching and complex tasks. A top-KK token approach is possible for lower compute but is less robust, particularly for language mixing.

C. Small-scale Reinforcement Fine-Tuning (RFT) with InK-GRPO

To inject factual and procedural knowledge not present in pretraining, Typhoon-S invokes compact RL fine-tuning, using the GRPO (Generalized Reward Policy Optimization, a PPO variant) objective. This is augmented with an auxiliary cross-entropy term on in-domain corpus (e.g., legal statutes):

L=LGRPO+λbLCE\mathcal{L} = \mathcal{L}_{\text{GRPO}} + \lambda b \mathcal{L}_{\mathrm{CE}}

Where bBernoulli(ρ)b \sim \mathrm{Bernoulli}(\rho) controls randomization (ρ=0.6\rho=0.6, λ=0.1\lambda=0.1). In agentic settings, the model interacts with “search” and “read” tools over a FAISS-indexed corpus, optimizing end-to-end for final-answer accuracy.

3. Thai Language Case Study

The Thai case exemplifies Typhoon-S in a live, low-resource, high-stakes context, leveraging both multilingual and domain-specific alignment:

Datasets

  • English: Tulu 3 (200 K), Toucan Tool (100 K)
  • Thai Instructions: Sourced from translated WildChat, WangchanThaiInstruct, Han, Typhoon Instruct
  • Response Generation: Thai AutoIF, using large-scale teacher LLMs with code-verifiable criteria
  • Augmentation: Constraint translation (EN ↔ TH), variation in prompt placement
  • Domain Supervision: NitiBench-CCL and MIRAGE-Bench (Thai legal, used for RL and CE data)

Compute Footprint & Model Sizes

  • Adoptability (8B): 8× H100, ≈2 days (SFT+OPD)
  • Sovereign agent (4B): 4× H100, ≈1 day (agentic InK-GRPO)
Stage Data Size Compute Core Hyperparameters
SFT 340K 8×H100, 2 days AdamW, lr=2e-5, batch=32
OPD 160K 8×H100, fused w/ SFT lr=1e-6, λ=0.25, 1 epoch
InK-GRPO (RFT) 160K 4×H100, 1 day lr=1e-6, ρ=0.6, λ=0.1

4. Evaluation Protocols and Empirical Outcomes

Typhoon-S employs a full sweep of multilingual, agentic, and sovereign task benchmarks:

General Capabilities (EN+TH)

Sovereign Benchmarks

  • NitiBench (Thai legal QA accuracy)
  • MIRAGE-Bench (TH legal domain)

Key Results

  • SFT alone results in performance deficits and brittleness (avg 37.45, code-switching 65.4, agentic 0) compared to strong baselines (48.07, 96.2).
  • Addition of OPD yields significant improvements (avg +6.5 pts to 43.94; code-switching to 93.4), recovers robust agentic behavior, and maintains base knowledge.
  • Full-logits OPD dramatically outperforms top-KK (code-switching: 93.4 vs 69.8).
  • Thai-specific data is vital: removing it results in ≈4 pt drop in SFT and impacts sovereign task performance after OPD.
  • When applied to a sovereign-adapted base (ThaiLLM-8B), Typhoon-S (SFT+OPD) achieves superior performance (Thai avg 71.20, Qwen3-8B 66.66) and competitive overall scores (49.99 vs 54.02).
  • InK-GRPO improves sovereign task accuracy on NitiBench (19.30% vs 15.82%) and MIRAGE (22.63% vs 20.99%), as well as agentic settings (NitiBench 78.02% vs GRPO 73.73% and GPT-5+Agent 75.34%).
  • General-purpose performance remains stable across RFT variants (avg ≈48–49 pts), indicating no catastrophic forgetting.

5. Trade-offs, Limitations, and Implications

Typhoon-S attains competitive region-specific and general performance using a fraction of the data and compute of mainstream LLM regimes:

  • Data/Compute Efficiency: 340K SFT + 160K OPD instructions (vs millions commonly used); two days on 8 GPUs (adoptability), one day on 4 GPUs (sovereign agent).
  • Performance: Comparable or superior to state-of-the-art open-weight baselines on Thai-centric and general benchmarks.
  • Transparency: All stages are open and auditable at the token and gradient level, supporting stringent sovereign oversight.

Limitations

  • No exploration of pre-training or mid-training under resource constraints.
  • Fixed hyperparameters (e.g., ρ,λ\rho, \lambda)—extensive ablations and tuning deferred.
  • The Thai focus reflects available data and expertise; extension to additional low-resource environments remains an open direction.
  • Long-term effects of repeated InK-GRPO (e.g., knowledge saturation, model drift) require further investigation.

A plausible implication is that Typhoon-S enables credible sovereign alternatives to closed LLMs without the need for massive training budgets, provided strong local data curation and staged post-training are feasible.

6. Prospects for Sovereign LLM Workflows

Typhoon-S provides a reproducible and minimal blueprint for regions or domains seeking to democratize advanced LLM capabilities under real-world resource limits. By decomposing LLM post-training into SFT, on-policy distillation, and compact agentic RFT/knowledge injection, it sidesteps reliance on proprietary data, closed tools, and large-scale infrastructure. Its demonstrated stability and efficiency suggest a practical path for the broader adoption of sovereign LLMs in diverse settings—pending future validation beyond the initial Thai case and further methodological refinements (Pipatanakul et al., 26 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Typhoon S.