Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sailor: Open Language Models for South-East Asia (2404.03608v1)

Published 4 Apr 2024 in cs.CL and cs.AI

Abstract: We present Sailor, a family of open LLMs ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great LLM for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing LLMs for multilingual use cases.

An Analysis of "Sailor: Open LLMs for South-East Asia"

The paper "Sailor: Open LLMs for South-East Asia" introduces the Sailor series of open LLMs, ranging from 0.5 billion to 7 billion parameters, specifically crafted for the South-East Asian (SEA) linguistic landscape. These models extend the Qwen1.5 architecture, incorporating a corpus covering English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao languages. The focus of the research lies in developing robust multilingual models capable of enhanced performance across multiple SEA languages through continual pre-training.

The researchers tackle several challenges in multilingual model development. They highlight the limitations encountered due to the "curse of multilinguality", where the predominance of English data in existing models leads to underperforming capabilities in non-English languages. Sailor leverages multiple strategic techniques, such as Byte Pair Encoding (BPE) dropout for improved robustness, aggressive data cleaning and deduplication, alongside the simulation of small proxy models to optimize the data mixture.

Experimental Approach

The experimental validations conducted span across key benchmarks involving commonsense reasoning, question-answering, reading comprehension, and examination-like settings. The results demonstrate that Sailor models manifest strong and consistent performance improvements over baseline models like Qwen1.5, suggesting their efficacy in multilingual tasks prevalent in SEA contexts.

A notable dimension of their approach is the focus on data composition and refinement. Extensive processes in data normalization, cleaning, and deduplication were employed to ensure high-quality input data. Their preprocessing pipeline adjusted for language-specific nuances, showing a removal of 31.11% and 11.16% of data during cleaning and deduplication stages, respectively. This meticulous curation yielded the SailCraft dataset, which was instrumental in enriching the continual pre-training outcomes of Sailor models.

Analytical Insights

The insights drawn from their development process are particularly informative. The use of BPE dropout was fundamental in enhancing model robustness, mitigating issues such as vulnerability to minor input variations—an aspect often overlooked in modeling. Additionally, their ablation studies utilizing smaller proxy models provided empirical evidence reinforcing the efficacy of particular strategies like data mixture optimization and careful hyperparameter tuning.

The research reiterates that embedding a blend of document-level and word-level code-switching techniques can bolster model adaptability in handling mixed language content—a common attribute in SEA linguistic environments. However, they acknowledge the potential of word-level code-switching led to marginal benefits alone, underscoring the nuanced nature of these interventions.

Implications and Future Directions

This paper underlines the importance of tailored LLMs for the increasingly digital communications ecosystem across SEA, a region marked by linguistic diversity. Practically, Sailor models offer a substantial uplift in accessibility and usability of AI-driven language technologies in this part of the world.

Looking forward, the researchers point out several compelling avenues: improving document-friendly deduplication, fostering cross-lingual instruction capabilities, and refining methodologies to cater to code-switching scenarios in language generation tasks. Additionally, increasing the linguistic coverage to incorporate more SEA languages would amplify the impact of such modeling efforts.

In conclusion, "Sailor: Open LLMs for South-East Asia" contributes significantly to the state of the art in multilingual LLM development. The meticulous attention to data quality, combined with innovative training techniques, underpins its advancements. This work represents a meaningful step toward democratizing AI capabilities globally, underscored by its commitment to open-source principles and regional linguistic inclusivity.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Yi: Open foundation models by 01.ai, 2024.
  2. AI Singapore. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion, 2023.
  3. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023. doi: 10.48550/ARXIV.2312.11805. URL https://doi.org/10.48550/arXiv.2312.11805.
  4. On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp.  4623–4637. Association for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.acl-main.421.
  5. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  6. The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884, 2023. URL https://doi.org/10.48550/arXiv.2308.16884.
  7. Andrei Z. Broder. On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp.  21–29, 1997. URL https://api.semanticscholar.org/CorpusID:11748509.
  8. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  9. When is multilinguality a curse? language modeling for 250 high- and low-resource languages. CoRR, abs/2311.09205, 2023. doi: 10.48550/ARXIV.2311.09205. URL https://doi.org/10.48550/arXiv.2311.09205.
  10. Tydi QA: A benchmark for information-seeking question answering in typologically diverse languages. Trans. Assoc. Comput. Linguistics, 8:454–470, 2020. URL https://doi.org/10.1162/tacl_a_00317.
  11. Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  12. Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
  13. OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  14. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. ArXiv, abs/2307.08691, 2023. URL https://api.semanticscholar.org/CorpusID:259936734.
  15. Flashattention: Fast and memory-efficient exact attention with io-awareness. ArXiv, abs/2205.14135, 2022. URL https://api.semanticscholar.org/CorpusID:249151871.
  16. CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp.  5960–5969, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.480. URL https://www.aclweb.org/anthology/2020.emnlp-main.480.
  17. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  18. Kenneth Heafield. KenLM: Faster and smaller language model queries. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (eds.), Proceedings of the Sixth Workshop on Statistical Machine Translation, pp.  187–197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. URL https://aclanthology.org/W11-2123.
  19. Mistral 7b. ArXiv, abs/2310.06825, 2023a. URL https://api.semanticscholar.org/CorpusID:263830494.
  20. Pre-rmsnorm and pre-crmsnorm transformers: Equivalent and efficient pre-ln transformers. ArXiv, abs/2305.14858, 2023b. URL https://api.semanticscholar.org/CorpusID:258865592.
  21. Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
  22. Efficient and effective vocabulary expansion towards multilingual large language models. ArXiv, abs/2402.14714, 2024. URL https://api.semanticscholar.org/CorpusID:267782714.
  23. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106.
  24. MADLAD-400: A multilingual and document-level large audited dataset. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/d49042a5d49818711c401d34172f9900-Abstract-Datasets_and_Benchmarks.html.
  25. Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
  26. Starcoder 2 and the stack v2: The next generation, 2024.
  27. Chenghaomou/text-dedup: Reference snapshot, September 2023. URL https://doi.org/10.5281/zenodo.8364980.
  28. Vinallama: Llama-based vietnamese foundation model. CoRR, abs/2312.11011, 2023a. doi: 10.48550/ARXIV.2312.11011. URL https://doi.org/10.48550/arXiv.2312.11011.
  29. Seallms - large language models for southeast asia. CoRR, abs/2312.00738, 2023b. doi: 10.48550/ARXIV.2312.00738. URL https://doi.org/10.48550/arXiv.2312.00738.
  30. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
  31. Typhoon: Thai large language models. CoRR, abs/2312.13951, 2023. doi: 10.48550/ARXIV.2312.13951. URL https://doi.org/10.48550/arXiv.2312.13951.
  32. XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp.  2362–2376. Association for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.emnlp-main.185.
  33. BPE-dropout: Simple and effective subword regularization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  1882–1892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.170. URL https://aclanthology.org/2020.acl-main.170.
  34. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp.  2383–2392. The Association for Computational Linguistics, 2016. URL https://doi.org/10.18653/v1/d16-1264.
  35. Ahad Rana. Common crawl – building an open web-scale crawl using hadoop, 2010. URL https://www.slideshare.net/hadoopusergroup/common-crawlpresentation.
  36. Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
  37. Noam Shazeer. Glu variants improve transformer, 2020.
  38. Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053, 2019. URL https://api.semanticscholar.org/CorpusID:202660670.
  39. SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
  40. Roformer: Enhanced transformer with rotary position embedding, 2022.
  41. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  42. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  43. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp.  6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
  44. Weaver: Foundation models for creative writing. CoRR, abs/2401.17268, 2024. doi: 10.48550/ARXIV.2401.17268. URL https://doi.org/10.48550/arXiv.2401.17268.
  45. Skywork: A more open bilingual foundation model. CoRR, abs/2310.19341, 2023. doi: 10.48550/ARXIV.2310.19341. URL https://doi.org/10.48550/arXiv.2310.19341.
  46. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pp.  4003–4012. European Language Resources Association, 2020a. URL https://aclanthology.org/2020.lrec-1.494/.
  47. Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of The 12th Language Resources and Evaluation Conference, pp.  4003–4012, 2020b.
  48. Lemur: Harmonizing natural language and code for language agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hNhwSmtXRh.
  49. Prompting multilingual large language models to generate code-mixed texts: The case of south East Asian languages. In Genta Winata, Sudipta Kar, Marina Zhukova, Thamar Solorio, Mona Diab, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali (eds.), Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pp.  43–63, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.calcs-1.5. URL https://aclanthology.org/2023.calcs-1.5.
  50. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp.  4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472.
  51. Tinyllama: An open-source small language model. CoRR, abs/2401.02385, 2024. doi: 10.48550/ARXIV.2401.02385. URL https://doi.org/10.48550/arXiv.2401.02385.
  52. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/117c5c8622b0d539f74f6d1fb082a2e9-Abstract-Datasets_and_Benchmarks.html.
  53. Llama beyond english: An empirical study on language capability transfer. CoRR, abs/2401.01055, 2024. doi: 10.48550/ARXIV.2401.01055. URL https://doi.org/10.48550/arXiv.2401.01055.
  54. Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi: 10.48550/ARXIV.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Longxu Dou (28 papers)
  2. Qian Liu (252 papers)
  3. Guangtao Zeng (14 papers)
  4. Jia Guo (101 papers)
  5. Jiahui Zhou (4 papers)
  6. Wei Lu (325 papers)
  7. Min Lin (96 papers)
Citations (5)