Unknown pretraining data for evaluated LLaMA models
Ascertain the identity and composition of the datasets used to pretrain the LLaMA1, LLaMA2, and LLaMA2-Chat models whose QuIP-quantized checkpoints are evaluated for token-level generalization bounds on the Amber dataset, to contextualize the empirical risk evaluation relative to the training distribution.
References
Although we do not know what data was used to pretrain the LLaMA models, our bounds remain valid since they do not require for the models to be trained on the same data that the empirical risk is evaluated on.
— Unlocking Tokens as Data Points for Generalization Bounds on Larger Language Models
(2407.18158 - Lotfi et al., 25 Jul 2024) in Section 5.2, Non-vacuous Bounds for Pretrained LLMs: GPT2, LLaMA1 and LLaMA2