SEA-LION v3 70B: Southeast Asian LLM

Updated 13 August 2025

SEA-LION v3 70B is a Southeast Asian-specific large language model that leverages transformer architecture and an expanded regional vocabulary to capture linguistic nuances of 11 languages.
It employs multi-stage instruction fine-tuning and continued pre-training on 200 billion tokens to enhance performance in classical NLP and culturally nuanced tasks.
Empirical evaluations reveal robust classical NLP and cultural knowledge capabilities, while highlighting limitations in reading comprehension and generative tasks.

SEA-LION v3 70B is a Southeast Asian–specific LLM in the SEA-LION family, optimized for high-fidelity representation and processing of 11 regional languages, including Filipino, Tagalog, and Cebuano. Developed as part of a broader initiative to address disparities in multilingual LLM capabilities, SEA-LION v3 70B employs advanced transformer-based architecture, extensive continued pre-training, and multi-stage instruction fine-tuning. While it demonstrates notable strengths in classical NLP and cultural knowledge tasks, empirical analysis reveals limitations in reading comprehension and generative capabilities for Filipino-centric benchmark tasks, indicating specific challenges and directions for further research.

1. Model Architecture and Linguistic Representation

SEA-LION v3 70B is built on transformer-based architecture derived from leading open-source backbones, with strategic augmentation for Southeast Asian linguistic features. Its lineage follows the SEA-LION v3 family, which includes Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT variants (Ng et al., 8 Apr 2025). The model’s architecture integrates expanded vocabulary tailored for diverse token distributions across 11 SEA languages, with mechanisms designed to support both non-Latin and under-resourced languages. This extended vocabulary ensures robust tokenization and embedding, capturing nuanced linguistic phenomena present in languages such as Khmer, Lao, Burmese, and especially Filipino.

Core transformer components are retained, but continued pre-training and fine-tuning phases employ regularization techniques (notably BPE-Dropout) to enhance model robustness. Parallel data streams and model merging approaches post-training further strengthen cross-lingual capabilities and mitigate catastrophic forgetting.

2. Training Protocols and Data Composition

SEA-LION v3 70B is trained using a multi-stage process that emphasizes both large-scale data coverage and task-specific refinement. Its continued pre-training (CPT) regime utilizes 200 billion tokens with the following proportional allocation: 55% SEA languages, 25% English, and 20% code (Ng et al., 8 Apr 2025). Data sources involve SEA-LION-Pile, SEA-LION-Pile-v2, CommonCrawl data filtered for SEA languages with fastText classifiers, and curated Wikipedia texts.

Optimization follows the Warmup-Stable-Decay (WSD) scheduler:

$\text{LR}(t) = \begin{cases} \frac{t}{T_{\text{warmup}}} \cdot \text{LR}_{\text{max}}, & t \leq T_{\text{warmup}} \ \text{LR}_{\text{max}} \times \left(\frac{T-t}{T-T_{\text{warmup}}}\right), & T_{\text{warmup}} < t \leq T \ \end{cases}$

with AdamW optimizer ( $\epsilon = 1\times10^{-15}$ ) and distributed training across eight AWS p5.48xlarge nodes using frameworks such as Composer, LLM Foundry, and FSDP.

Instruction Fine-Tuning (IFT) and Alignment stages leverage datasets including Infinity-Instruct (Foundation/Chat), OpenMath-Instruct 2, and SEA-Instruct. The two-phase IFT workflow initially focuses on English tasks (reasoning, math, code), followed by multilingual, SEA-centric instructions. Model merging (DARE TIES, Consensus TA, DELLA-Linear) and preference alignment (SimPO) further optimize cross-lingual and culturally appropriate response behavior, with memory efficiency innovations via Liger Kernel enabling feasible training on reduced compute.

3. Benchmarked Performance: FilBench Analysis

SEA-LION v3 70B’s performance is evaluated on FilBench (Miranda et al., 5 Aug 2025), a benchmark centered on Filipino, Tagalog, and Cebuano across four principal task categories: Classical NLP (CN), Cultural Knowledge (CK), Reading Comprehension (RC), and Generation (GN). The overall score is computed by:

$\text{FilBench Score} = 100 \times \frac{\sum_{i \in \{\text{CN}, \text{CK}, \text{RC}, \text{GN}\}} n_i S_i}{\sum_{i \in \{\text{CN}, \text{CK}, \text{RC}, \text{GN}\}} n_i}$

where $n_i$ denotes the number of examples and $S_i$ the score for each category.

Observed scores for SEA-LION v3 70B:

Category	Score (%)	Standard Deviation
Classical NLP (CN)	89.99	±0.23
Cultural Knowledge (CK)	76.78	±3.02
Reading Comprehension (RC)	53.56	±0.19
Generation (GN)	23.95	±0.34
Overall	61.07	±0.95

This profile indicates robust performance in CN and CK, moderate capacity in RC, and substantial limitations in GN. SEA-LION v3 70B outperforms other SEA-specific models but remains below broader multilingual models—GPT-4o achieves approximately 72.73% overall in aggregated FilBench scoring.

4. Strengths, Limitations, and Error Analysis

SEA-LION v3 70B is distinguished by proficiency in Classical NLP, exemplified by its near 90% accuracy in text classification, NER, and sentiment analysis. Its strong CK score reflects capacity for recalling Filipino cultural facts and handling culturally nuanced tasks, suggestive of effective region-specific entity and value learning.

However, performance on Generation and Reading Comprehension tasks reveals several challenges:

Poor Instruction-Following: The model occasionally misinterprets translation directives, outputting responses in incorrect target languages or producing verbose, unrelated text.
Overgeneration: Outputs tend to be excessively lengthy, which correlates with reduced coherence and increased risk of deviation from input intent.
Hallucination: Some responses include details not present in the source, introducing spurious or irrelevant information.
Comprehension Gaps: Modest RC scores suggest difficulties in parsing and contextually responding to extended Filipino passages, a limitation plausibly linked to low-resource dataset complexity and insufficient fine-tuning.

Relative underperformance compared to models like GPT-4o and Llama-4 Maverick highlights the challenge of balancing high specificity for regional languages with general generative and comprehension competency.

5. Multilingual Capacity and Southeast Asian Language Representation

SEA-LION v3 models, including the 70B variant, are characterized by deep support for 11 SEA languages. Their vocabulary and pre-training data allocation (55% SEA language tokens) enable effective modeling for both high-resource (English, Chinese) and under-represented languages (Khmer, Lao, Burmese) (Ng et al., 8 Apr 2025). The targeted post-training and alignment methodology further reinforces cross-lingual and culturally sensitive capabilities.

On regionally tailored benchmarks (SEA-HELM, SEA-MTBench), SEA-LION v3 models consistently outperform open-source multilingual counterparts such as BLOOM, Llama, and Qwen for SEA language tasks. This suggests that carefully curated data mixes and culturally aligned training protocols yield superior results for low-resource languages.

6. Open-Source Licensing and Community Impact

SEA-LION v3 70B is released under a MIT license, ensuring unrestricted commercial and research access (Ng et al., 8 Apr 2025). The open methodology—detailed model cards, training documentation, and reproducibility protocols—facilitates broader adoption and adaptation across the SEA research community. This release addresses historical gaps in resource availability for Southeast Asian NLP, supporting efforts to enhance local digital ecosystems and preserve linguistic diversity.

The open-source nature fosters inclusive participation, enabling customization for local needs and cross-institution collaboration. Community-driven improvement cycles are anticipated to benefit regional NLP applications and heritage conservation.

7. Directions for Improvement and Future Research

Identified challenges in generation and comprehension prompt several recommendations (Miranda et al., 5 Aug 2025):

Targeted Data Augmentation: Incorporating high-quality, Filipino-centric datasets—especially those focusing on generative and comprehension tasks—can assist in overcoming current model limitations.
Fine-Tuning: Additional rounds of fine-tuning for translation, reading comprehension, and multilingual generation are suggested to mitigate instruction-following and overgeneration errors.
Prompt Engineering: Few-shot prompting with explicit examples has been shown to reduce hallucinations and improve adherence to language-specific instructions.
Decoding Strategy Optimization: Imposing stricter decoding constraints (e.g., max token limits, length penalties) may control output length and coherence, directly addressing observed overgeneration.
Cultural and Linguistic Differentiation: Enhanced training focus on distinguishing closely related languages (e.g., Filipino vs. Cebuano) can improve task-specific model consistency.

A plausible implication is that comprehensive post-training and robust evaluation on culturally representative and linguistically diverse benchmarks will be essential for future SEA-specific LLM advancement.

SEA-LION v3 70B represents a critical step toward equitable natural language understanding and generation for Southeast Asian languages, providing strong benchmarks in classical NLP and cultural domains while defining new challenges in generation and comprehension for low-resource settings. Its open-source release and empirical underpinnings provide a robust foundation for ongoing academic inquiry and practical NLP innovation throughout the region.

PDF Markdown Chat (Pro)

References (2)

SEA-LION: Southeast Asian Languages in One Network (2025)

FilBench: Can LLMs Understand and Generate Filipino? (2025)

Follow Topic

Get notified by email when new papers are published related to SEA-LION v3 70B.