Determine overlap between OpenTelecom dataset and Llama‑3 pretraining corpus
Ascertain whether the OpenTelecom pre-training dataset used for adapting general-purpose language models to the telecom domain includes material that was already used in the pre-training data of Meta Llama 3‑8B, in order to clarify potential data overlap and its implications for continual pre-training decisions and evaluation fairness.
Sponsor
References
For Llama3-8B, we choose to not further pre-train it for two main reasons: i) the pre-trained dataset of Llama-3 series contains 15 TB tokens, our pre-training dataset is largely built on open documents on the web, it is uncertain to determine if our dataset is already used to train Llama3-8B; ii) our hardware limit us to efficiently continue pre-train in a reasonable batch size.
— TelecomGPT: A Framework to Build Telecom-Specfic Large Language Models
(2407.09424 - Zou et al., 12 Jul 2024) in Section 6 (Training Detail), Domain-Specific Continual Pretraining