Understanding prompt engineering may not require rethinking generalization (2310.03957v1)
Abstract: Zero-shot learning in prompted vision-LLMs, the practice of crafting prompts to build classifiers without an explicit training process, has achieved impressive performance in many settings. This success presents a seemingly surprising observation: these methods suffer relatively little from overfitting, i.e., when a prompt is manually engineered to achieve low error on a given training set (thus rendering the method no longer actually zero-shot), the approach still performs well on held-out test data. In this paper, we show that we can explain such performance well via recourse to classical PAC-Bayes bounds. Specifically, we show that the discrete nature of prompts, combined with a PAC-Bayes prior given by a LLM, results in generalization bounds that are remarkably tight by the standards of the literature: for instance, the generalization bound of an ImageNet classifier is often within a few percentage points of the true test error. We demonstrate empirically that this holds for existing handcrafted prompts and prompts generated through simple greedy search. Furthermore, the resulting bound is well-suited for model selection: the models with the best bound typically also have the best test performance. This work thus provides a possible justification for the widespread practice of prompt engineering, even if it seems that such methods could potentially overfit the training data.
- Promptsource: An integrated development environment and repository for natural language prompts. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 93–104, 2022.
- Spectrally-normalized margin bounds for neural networks. ArXiv, abs/1706.08498, 2017.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Functional map of the world. In CVPR, 2018.
- Multiclass learnability and the erm principle. J. Mach. Learn. Res., 16(1):2377–2404, 2015.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
- In search of robust measures of generalization. Advances in Neural Information Processing Systems, 33:11723–11733, 2020.
- On the role of data in pac-bayes bounds. In International Conference on Artificial Intelligence and Statistics, pp. 604–612. PMLR, 2021.
- Making pre-trained language models better few-shot learners. In Joint Conference of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL-IJCNLP 2021, pp. 3816–3830. Association for Computational Linguistics (ACL), 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pp. 4904–4916. PMLR, 2021.
- Fantastic generalization measures and where to find them. arXiv preprint arXiv:1912.02178, 2019.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- (not) bounding the true error. In NIPS, 2001.
- How many data points is a prompt worth? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2627–2636, 2021.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597, 2021.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
- Pac-bayes compression bounds so tight that they can explain generalization. Advances in Neural Information Processing Systems, 35:31459–31473, 2022.
- David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pp. 164–170, 1999.
- Risk bounds for transferring representations with and without fine-tuning. In International Conference on Machine Learning, pp. 2373–2381. PMLR, 2017.
- Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019a.
- Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019b.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014.
- Norm-based capacity control in neural networks. ArXiv, abs/1503.00036, 2015.
- Exploring generalization in deep learning. Advances in neural information processing systems, 30, 2017a.
- A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. ArXiv, abs/1707.09564, 2017b.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pp. 8748–8763. PMLR, 2021.
- Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
- Zero-shot image-to-text generation for visual-semantic arithmetic. arXiv preprint arXiv:2111.14447, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Vladimir Vapnik. Principles of risk minimization for learning theory. Advances in neural information processing systems, 4, 1991.
- Theory of pattern recognition, 1974.
- Vladimir Naumovich Vapnik. Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities. 1971.
- Deep hashing network for unsupervised domain adaptation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5018–5027, 2017.
- A general framework for the disintegration of pac-bayesian bounds. arXiv preprint arXiv:2102.08649, 2021.
- Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023.
- Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7959–7971, 2022.
- Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022a.
- Non-vacuous generalization bounds at the imagenet scale: A pac-bayesian compression approach. In 7th International Conference on Learning Representations, ICLR 2019, 2019.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910, 2022b.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.