LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models (2406.00605v1)
Abstract: We introduce LongSkywork, a long-context LLM capable of processing up to 200,000 tokens. We provide a training recipe for efficiently extending context length of LLMs. We identify that the critical element in enhancing long-context processing capability is to incorporate a long-context SFT stage following the standard SFT stage. A mere 200 iterations can convert the standard SFT model into a long-context model. To reduce the effort in collecting and annotating data for long-context LLMing, we develop two novel methods for creating synthetic data. These methods are applied during the continual pretraining phase as well as the Supervised Fine-Tuning (SFT) phase, greatly enhancing the training efficiency of our long-context LLMs. Our findings suggest that synthetic long-context SFT data can surpass the performance of data curated by humans to some extent. LongSkywork achieves outstanding performance on a variety of long-context benchmarks. In the Needle test, a benchmark for long-context information retrieval, our models achieved perfect accuracy across multiple context spans. Moreover, in realistic application scenarios, LongSkywork-13B demonstrates performance on par with Claude2.1, the leading long-context model, underscoring the effectiveness of our proposed methods.
- Liang Zhao (353 papers)
- Tianwen Wei (20 papers)
- Liang Zeng (31 papers)
- Cheng Cheng (188 papers)
- Liu Yang (194 papers)
- Peng Cheng (229 papers)
- Lijie Wang (23 papers)
- Chenxia Li (12 papers)
- Xuejie Wu (3 papers)
- Bo Zhu (83 papers)
- Yimeng Gan (1 paper)
- Rui Hu (96 papers)
- Shuicheng Yan (275 papers)
- Han Fang (61 papers)
- Yahui Zhou (18 papers)