PolyLM: An Open Source Polyglot Large Language Model (2307.06018v1)
Abstract: LLMs demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.
- Xiangpeng Wei (15 papers)
- Haoran Wei (55 papers)
- Huan Lin (55 papers)
- Tianhao Li (35 papers)
- Pei Zhang (119 papers)
- Xingzhang Ren (13 papers)
- Mei Li (41 papers)
- Yu Wan (18 papers)
- Zhiwei Cao (13 papers)
- Binbin Xie (5 papers)
- Tianxiang Hu (13 papers)
- Shangjie Li (2 papers)
- Binyuan Hui (57 papers)
- Bowen Yu (89 papers)
- Dayiheng Liu (75 papers)
- Baosong Yang (57 papers)
- Fei Huang (408 papers)
- Jun Xie (66 papers)