Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Skywork: A More Open Bilingual Foundation Model (2310.19341v1)

Published 30 Oct 2023 in cs.CL and cs.AI

Abstract: In this technical report, we present Skywork-13B, a family of LLMs trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts. This bilingual foundation model is the most extensively trained and openly published LLMs of comparable size to date. We introduce a two-stage training methodology using a segmented corpus, targeting general purpose training and then domain-specific enhancement training, respectively. We show that our model not only excels on popular benchmarks, but also achieves \emph{state of the art} performance in Chinese LLMing on diverse domains. Furthermore, we propose a novel leakage detection method, demonstrating that test data contamination is a pressing issue warranting further investigation by the LLM community. To spur future research, we release Skywork-13B along with checkpoints obtained during intermediate stages of the training process. We are also releasing part of our SkyPile corpus, a collection of over 150 billion tokens of web text, which is the largest high quality open Chinese pre-training corpus to date. We hope Skywork-13B and our open corpus will serve as a valuable open-source resource to democratize access to high-quality LLMs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (30)
  1. Tianwen Wei (20 papers)
  2. Liang Zhao (353 papers)
  3. Lichang Zhang (1 paper)
  4. Bo Zhu (83 papers)
  5. Lijie Wang (23 papers)
  6. Haihua Yang (5 papers)
  7. Biye Li (6 papers)
  8. Cheng Cheng (188 papers)
  9. Weiwei Lü (2 papers)
  10. Rui Hu (96 papers)
  11. Chenxia Li (12 papers)
  12. Liu Yang (194 papers)
  13. Xilin Luo (2 papers)
  14. Xuejie Wu (3 papers)
  15. Lunan Liu (2 papers)
  16. Wenjun Cheng (2 papers)
  17. Peng Cheng (229 papers)
  18. Jianhao Zhang (31 papers)
  19. Xiaoyu Zhang (144 papers)
  20. Lei Lin (42 papers)
Citations (80)
X Twitter Logo Streamline Icon: https://streamlinehq.com