Determine overlap between OpenTelecom dataset and Llama‑3 pretraining corpus

Ascertain whether the OpenTelecom pre-training dataset used for adapting general-purpose language models to the telecom domain includes material that was already used in the pre-training data of Meta Llama 3‑8B, in order to clarify potential data overlap and its implications for continual pre-training decisions and evaluation fairness.

Background

In the training details, the authors explain why they did not perform continual pre-training on the Llama 3‑8B base model. One reason given is uncertainty about whether their telecom pre-training corpus overlaps with the proprietary pre-training data of Llama‑3. This uncertainty affects decisions around whether further continual pre-training is appropriate and could impact the interpretation of results due to potential data leakage.

Clarifying any overlap between the OpenTelecom corpus and the Llama‑3 pretraining dataset would improve transparency and ensure that evaluation claims about domain adaptation are made on a fair basis without inadvertent reuse of training content.

References

For Llama3-8B, we choose to not further pre-train it for two main reasons: i) the pre-trained dataset of Llama-3 series contains 15 TB tokens, our pre-training dataset is largely built on open documents on the web, it is uncertain to determine if our dataset is already used to train Llama3-8B; ii) our hardware limit us to efficiently continue pre-train in a reasonable batch size.

TelecomGPT: A Framework to Build Telecom-Specfic Large Language Models (2407.09424 - Zou et al., 12 Jul 2024) in Section 6 (Training Detail), Domain-Specific Continual Pretraining