SEA-LION: Southeast Asian Languages in One Network (2504.05747v2)
Abstract: Recently, LLMs have dominated much of the artificial intelligence scene with their ability to process and generate natural languages. However, the majority of LLM research and development remains English-centric, leaving low-resource languages such as those in the Southeast Asian (SEA) region under-represented. To address this representation gap, we introduce Llama-SEA-LION-v3-8B-IT and Gemma-SEA-LION-v3-9B-IT, two cutting-edge multilingual LLMs designed for SEA languages. The SEA-LION family of LLMs supports 11 SEA languages, namely English, Chinese, Indonesian, Vietnamese, Malay, Thai, Burmese, Lao, Filipino, Tamil, and Khmer. Our work leverages large-scale multilingual continued pre-training with a comprehensive post-training regime involving multiple stages of instruction fine-tuning, alignment, and model merging. Evaluation results on multilingual benchmarks indicate that our models achieve state-of-the-art performance across LLMs supporting SEA languages. We open-source the models to benefit the wider SEA community.
- Raymond Ng (13 papers)
- Thanh Ngan Nguyen (1 paper)
- Yuli Huang (6 papers)
- Ngee Chia Tai (2 papers)
- Wai Yi Leong (1 paper)
- Wei Qi Leong (7 papers)
- Xianbin Yong (1 paper)
- Jian Gang Ngui (5 papers)
- Yosephine Susanto (6 papers)
- Nicholas Cheng (2 papers)
- Hamsawardhini Rengarajan (5 papers)
- Peerat Limkonchotiwat (19 papers)
- Adithya Venkatadri Hulagadri (3 papers)
- Kok Wai Teng (1 paper)
- Yeo Yeow Tong (1 paper)
- Bryan Siow (1 paper)
- Wei Yi Teo (1 paper)
- Wayne Lau (1 paper)
- Choon Meng Tan (1 paper)
- Brandon Ong (1 paper)