Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PolyLM: An Open Source Polyglot Large Language Model (2307.06018v1)

Published 12 Jul 2023 in cs.CL

Abstract: LLMs demonstrate remarkable ability to comprehend, reason, and generate following nature language instructions. However, the development of LLMs has been primarily focused on high-resource languages, such as English, thereby limiting their applicability and research in other languages. Consequently, we present PolyLM, a multilingual LLM trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B. To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training. Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning. To assess the model's performance, we collect several existing multilingual tasks, including multilingual understanding, question answering, generation, and translation. Extensive experiments show that PolyLM surpasses other open-source models such as LLaMA and BLOOM on multilingual tasks while maintaining comparable performance in English. Our models, alone with the instruction data and multilingual benchmark, are available at: \url{https://modelscope.cn/models/damo/nlp_polylm_13b_text_generation}.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Xiangpeng Wei (15 papers)
  2. Haoran Wei (55 papers)
  3. Huan Lin (55 papers)
  4. Tianhao Li (35 papers)
  5. Pei Zhang (119 papers)
  6. Xingzhang Ren (13 papers)
  7. Mei Li (41 papers)
  8. Yu Wan (18 papers)
  9. Zhiwei Cao (13 papers)
  10. Binbin Xie (5 papers)
  11. Tianxiang Hu (13 papers)
  12. Shangjie Li (2 papers)
  13. Binyuan Hui (57 papers)
  14. Bowen Yu (89 papers)
  15. Dayiheng Liu (75 papers)
  16. Baosong Yang (57 papers)
  17. Fei Huang (408 papers)
  18. Jun Xie (66 papers)
Citations (50)