Non-Vacuous Generalization Bounds for Large Language Models (2312.17173v3)
Abstract: Modern LLMs can contain billions of parameters, raising the question of whether they can generalize beyond the training data or simply parrot their training corpora. We provide the first non-vacuous generalization bounds for pretrained LLMs, indicating that LLMs are capable of discovering regularities that generalize to unseen data. In particular, we derive a compression bound that is valid for the unbounded log-likelihood loss using prediction smoothing, and we extend the bound to handle subsampling, accelerating bound computation by orders of magnitude on massive datasets. To achieve the extreme level of compression required for non-vacuous bounds, we devise SubLoRA, a simple low-dimensional nonlinear parameterization that leads to non-vacuous generalization bounds for models with nearly a billion parameters. Finally, we use our bounds to understand LLM generalization and find that larger models have better generalization bounds and are more compressible than smaller models.
- Intrinsic dimensionality explains the effectiveness of language model fine-tuning. arXiv preprint arXiv:2012.13255, 2020.
- On the properties of variational approximations of gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016.
- Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
- A PAC-Bayesian approach to minimum perplexity language modeling. In Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, pp. 130–140, Dublin, Ireland, 2014.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Extracting training data from large language models. arXiv preprint arXiv:2012.07805, 2020.
- Quantifying memorization across neural language models. Proceedings of the 37th International Conference on Learning Representations (ICLR 2023), 2023.
- Olivier Catoni. Pac-bayesian supervised classification: the thermodynamics of statistical learning. arXiv preprint arXiv:0712.0248, 2007.
- Palm: Scaling language modeling with pathways, 2022.
- Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
- Language modeling is compression. arXiv preprint arXiv:2309.10668, 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023a.
- Spqr: A sparse-quantized representation for near-lossless llm weight compression. arXiv preprint arXiv:2308.07234, 2023b.
- Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
- Gptq: Accurate post-training quantization for generative pre-trained transformers. arXiv preprint arXiv:2210.17323, 2022.
- Pac-bayesian theory meets bayesian inference. Advances in Neural Information Processing Systems, 29, 2016.
- The no free lunch theorem, kolmogorov complexity, and the role of inductive biases in machine learning. arXiv preprint arXiv:2304.05366, 2023.
- Implicit regularization in matrix factorization. Advances in neural information processing systems, 30, 2017.
- Pac-bayes unleashed: Generalisation bounds with unbounded losses. Entropy, 23(10):1330, 2021.
- Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp. 5549–5581. PMLR, 2023.
- Wassily Hoeffding. Probability inequalities for sums of bounded random variables. The collected works of Wassily Hoeffding, pp. 409–426, 1994.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Memory-efficient fine-tuning of compressed large language models via sub-4-bit integer quantization. arXiv preprint arXiv:2305.14152, 2023.
- Glen G Langdon. An introduction to arithmetic coding. IBM Journal of Research and Development, 28(2):135–149, 1984.
- Measuring the intrinsic dimension of objective landscapes. arXiv preprint arXiv:1804.08838, 2018.
- Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- Pac-bayes compression bounds so tight that they can explain generalization. Advances in Neural Information Processing Systems, 35:31459–31473, 2022.
- Generalization error bounds for stationary autoregressive models. arXiv preprint arXiv:1103.0942, 2011.
- Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
- Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
- Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language model. arXiv preprint arXiv:2206.09557, 2022.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005. ISBN 026218253X.
- Ray J Solomonoff. A formal theory of inductive inference. part i. Information and control, 7(1):1–22, 1964.
- Causal forecasting: generalization bounds for autoregressive models. In Uncertainty in Artificial Intelligence, pp. 2002–2012. PMLR, 2022.
- Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.
- Tensorgpt: Efficient compression of the embedding layer in llms based on the tensor-train decomposition. arXiv preprint arXiv:2307.00526, 2023.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Non-vacuous generalization bounds at the imagenet scale: a pac-bayesian compression approach. In International Conference on Learning Representations, 2019.