Sailor: Open Language Models for South-East Asia (2404.03608v1)
Abstract: We present Sailor, a family of open LLMs ranging from 0.5B to 7B parameters, tailored for South-East Asian (SEA) languages. These models are continually pre-trained from Qwen1.5, a great LLM for multilingual use cases. From Qwen1.5, Sailor models accept 200B to 400B tokens, primarily covering the languages of English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao. The training leverages several techniques, including BPE dropout for improving the model robustness, aggressive data cleaning and deduplication, and small proxy models to optimize data mixture. Experimental results on four typical tasks indicate that Sailor models demonstrate strong performance across different benchmarks, including commonsense reasoning, question answering, reading comprehension and examination. Embracing the open-source spirit, we share our insights through this report to spark a wider interest in developing LLMs for multilingual use cases.
- Yi: Open foundation models by 01.ai, 2024.
- AI Singapore. Sea-lion (southeast asian languages in one network): A family of large language models for southeast asia. https://github.com/aisingapore/sealion, 2023.
- Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805, 2023. doi: 10.48550/ARXIV.2312.11805. URL https://doi.org/10.48550/arXiv.2312.11805.
- On the cross-lingual transferability of monolingual representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pp. 4623–4637. Association for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.acl-main.421.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- The belebele benchmark: a parallel reading comprehension dataset in 122 language variants. CoRR, abs/2308.16884, 2023. URL https://doi.org/10.48550/arXiv.2308.16884.
- Andrei Z. Broder. On the resemblance and containment of documents. Proceedings. Compression and Complexity of SEQUENCES 1997 (Cat. No.97TB100171), pp. 21–29, 1997. URL https://api.semanticscholar.org/CorpusID:11748509.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- When is multilinguality a curse? language modeling for 250 high- and low-resource languages. CoRR, abs/2311.09205, 2023. doi: 10.48550/ARXIV.2311.09205. URL https://doi.org/10.48550/arXiv.2311.09205.
- Tydi QA: A benchmark for information-seeking question answering in typologically diverse languages. Trans. Assoc. Comput. Linguistics, 8:454–470, 2020. URL https://doi.org/10.1162/tacl_a_00317.
- Together Computer. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Unsupervised cross-lingual representation learning at scale. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 8440–8451, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.747. URL https://aclanthology.org/2020.acl-main.747.
- OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
- Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. ArXiv, abs/2307.08691, 2023. URL https://api.semanticscholar.org/CorpusID:259936734.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. ArXiv, abs/2205.14135, 2022. URL https://api.semanticscholar.org/CorpusID:249151871.
- CCAligned: A massive collection of cross-lingual web-document pairs. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP 2020), pp. 5960–5969, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.480. URL https://www.aclweb.org/anthology/2020.emnlp-main.480.
- A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
- Kenneth Heafield. KenLM: Faster and smaller language model queries. In Chris Callison-Burch, Philipp Koehn, Christof Monz, and Omar F. Zaidan (eds.), Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197, Edinburgh, Scotland, July 2011. Association for Computational Linguistics. URL https://aclanthology.org/W11-2123.
- Mistral 7b. ArXiv, abs/2310.06825, 2023a. URL https://api.semanticscholar.org/CorpusID:263830494.
- Pre-rmsnorm and pre-crmsnorm transformers: Equivalent and efficient pre-ln transformers. ArXiv, abs/2305.14858, 2023b. URL https://api.semanticscholar.org/CorpusID:258865592.
- Fasttext.zip: Compressing text classification models. arXiv preprint arXiv:1612.03651, 2016.
- Efficient and effective vocabulary expansion towards multilingual large language models. ArXiv, abs/2402.14714, 2024. URL https://api.semanticscholar.org/CorpusID:267782714.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106.
- MADLAD-400: A multilingual and document-level large audited dataset. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/d49042a5d49818711c401d34172f9900-Abstract-Datasets_and_Benchmarks.html.
- Deduplicating training data makes language models better. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 8424–8445, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.577. URL https://aclanthology.org/2022.acl-long.577.
- Starcoder 2 and the stack v2: The next generation, 2024.
- Chenghaomou/text-dedup: Reference snapshot, September 2023. URL https://doi.org/10.5281/zenodo.8364980.
- Vinallama: Llama-based vietnamese foundation model. CoRR, abs/2312.11011, 2023a. doi: 10.48550/ARXIV.2312.11011. URL https://doi.org/10.48550/arXiv.2312.11011.
- Seallms - large language models for southeast asia. CoRR, abs/2312.00738, 2023b. doi: 10.48550/ARXIV.2312.00738. URL https://doi.org/10.48550/arXiv.2312.00738.
- The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023. URL https://arxiv.org/abs/2306.01116.
- Typhoon: Thai large language models. CoRR, abs/2312.13951, 2023. doi: 10.48550/ARXIV.2312.13951. URL https://doi.org/10.48550/arXiv.2312.13951.
- XCOPA: A multilingual dataset for causal commonsense reasoning. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pp. 2362–2376. Association for Computational Linguistics, 2020. URL https://doi.org/10.18653/v1/2020.emnlp-main.185.
- BPE-dropout: Simple and effective subword regularization. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 1882–1892, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.170. URL https://aclanthology.org/2020.acl-main.170.
- Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pp. 2383–2392. The Association for Computational Linguistics, 2016. URL https://doi.org/10.18653/v1/d16-1264.
- Ahad Rana. Common crawl – building an open web-scale crawl using hadoop, 2010. URL https://www.slideshare.net/hadoopusergroup/common-crawlpresentation.
- Neural machine translation of rare words with subword units. In Katrin Erk and Noah A. Smith (eds.), Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725, Berlin, Germany, August 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URL https://aclanthology.org/P16-1162.
- Noam Shazeer. Glu variants improve transformer, 2020.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. ArXiv, abs/1909.08053, 2019. URL https://api.semanticscholar.org/CorpusID:202660670.
- SlimPajama: A 627B token cleaned and deduplicated version of RedPajama. https://www.cerebras.net/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama, 2023. URL https://huggingface.co/datasets/cerebras/SlimPajama-627B.
- Roformer: Enhanced transformer with rotary position embedding, 2022.
- Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/ARXIV.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/ARXIV.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
- Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pp. 6000–6010, Red Hook, NY, USA, 2017. Curran Associates Inc. ISBN 9781510860964.
- Weaver: Foundation models for creative writing. CoRR, abs/2401.17268, 2024. doi: 10.48550/ARXIV.2401.17268. URL https://doi.org/10.48550/arXiv.2401.17268.
- Skywork: A more open bilingual foundation model. CoRR, abs/2310.19341, 2023. doi: 10.48550/ARXIV.2310.19341. URL https://doi.org/10.48550/arXiv.2310.19341.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Hélène Mazo, Asunción Moreno, Jan Odijk, and Stelios Piperidis (eds.), Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pp. 4003–4012. European Language Resources Association, 2020a. URL https://aclanthology.org/2020.lrec-1.494/.
- Ccnet: Extracting high quality monolingual datasets from web crawl data. In Proceedings of The 12th Language Resources and Evaluation Conference, pp. 4003–4012, 2020b.
- Lemur: Harmonizing natural language and code for language agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=hNhwSmtXRh.
- Prompting multilingual large language models to generate code-mixed texts: The case of south East Asian languages. In Genta Winata, Sudipta Kar, Marina Zhukova, Thamar Solorio, Mona Diab, Sunayana Sitaram, Monojit Choudhury, and Kalika Bali (eds.), Proceedings of the 6th Workshop on Computational Approaches to Linguistic Code-Switching, pp. 43–63, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.calcs-1.5. URL https://aclanthology.org/2023.calcs-1.5.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 4791–4800. Association for Computational Linguistics, 2019. doi: 10.18653/V1/P19-1472. URL https://doi.org/10.18653/v1/p19-1472.
- Tinyllama: An open-source small language model. CoRR, abs/2401.02385, 2024. doi: 10.48550/ARXIV.2401.02385. URL https://doi.org/10.48550/arXiv.2401.02385.
- M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/117c5c8622b0d539f74f6d1fb082a2e9-Abstract-Datasets_and_Benchmarks.html.
- Llama beyond english: An empirical study on language capability transfer. CoRR, abs/2401.01055, 2024. doi: 10.48550/ARXIV.2401.01055. URL https://doi.org/10.48550/arXiv.2401.01055.
- Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023. doi: 10.48550/ARXIV.2304.06364. URL https://doi.org/10.48550/arXiv.2304.06364.