Unknown pretraining data for evaluated LLaMA models

Ascertain the identity and composition of the datasets used to pretrain the LLaMA1, LLaMA2, and LLaMA2-Chat models whose QuIP-quantized checkpoints are evaluated for token-level generalization bounds on the Amber dataset, to contextualize the empirical risk evaluation relative to the training distribution.

Background

The paper evaluates token-level generalization bounds for publicly available, QuIP-quantized LLaMA models (including LLaMA1, LLaMA2, and LLaMA2-Chat) on the Amber dataset. Because these models are released without full disclosure of their pretraining corpora, the authors note that the exact pretraining data is unknown.

Although the bounds do not require the evaluation dataset to match the pretraining data, identifying the pretraining datasets would clarify the relationship between the training distribution and the evaluation corpus and could inform interpretation of generalization behavior.

References

Although we do not know what data was used to pretrain the LLaMA models, our bounds remain valid since they do not require for the models to be trained on the same data that the empirical risk is evaluated on.

— Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models (2407.18158 - Lotfi et al., 25 Jul 2024) in Section 5.2, Non-vacuous Bounds for Pretrained LLMs: GPT2, LLaMA1 and LLaMA2

Unknown pretraining data for evaluated LLaMA models

Background

References

Related Problems