Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Source Prompt: Coordinated Pre-training of Language Models on Diverse Corpora from Multiple Sources (2311.09732v1)

Published 16 Nov 2023 in cs.CL and cs.AI

Abstract: Pre-trained LLMs (PLMs) have established the new paradigm in the field of NLP. For more powerful PLMs, one of the most popular and successful way is to continuously scale up sizes of the models and the pre-training corpora. These large corpora are generally obtained by converging smaller ones from multiple sources, they are thus growing increasingly diverse. However, the side-effects of these colossal converged corpora remain understudied. In this paper, we identify the disadvantage of heterogeneous corpora from multiple sources for pre-training PLMs. Towards coordinated pre-training on diverse corpora, we further propose source prompts (SP), which explicitly prompt the model of the data source at the pre-training and fine-tuning stages. Results of extensive experiments demonstrate that PLMs pre-trained with SP on diverse corpora gain significant improvement in various downstream tasks.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yipei Xu (3 papers)
  2. Dakuan Lu (7 papers)
  3. Jiaqing Liang (62 papers)
  4. Xintao Wang (132 papers)
  5. Yipeng Geng (2 papers)
  6. Yingsi Xin (3 papers)
  7. Hengkui Wu (3 papers)
  8. Ken Chen (29 papers)
  9. ruiji zhang (3 papers)
  10. Yanghua Xiao (151 papers)

Summary

We haven't generated a summary for this paper yet.