Mind Your Format: Towards Consistent Evaluation of In-Context Learning Improvements (2401.06766v3)
Abstract: LLMs demonstrate a remarkable capability for learning to solve new tasks from a few examples. The prompt template, or the way the input examples are formatted to obtain the prompt, is an important yet often overlooked aspect of in-context learning. In this work, we conduct a comprehensive study of the template format's influence on the in-context learning performance. We evaluate the impact of the prompt template across 21 models (from 770M to 70B parameters) and 4 standard classification datasets. We show that a poor choice of the template can reduce the performance of the strongest models and inference methods to a random guess level. More importantly, the best templates do not transfer between different setups and even between models of the same family. Our findings show that the currently prevalent approach to evaluation, which ignores template selection, may give misleading results due to different templates in different works. As a first step towards mitigating this issue, we propose Template Ensembles that aggregate model predictions across several templates. This simple test-time augmentation boosts average performance while being robust to the choice of random set of templates.
- What learning algorithm is in-context learning? investigations with linear models. In The Eleventh International Conference on Learning Representations.
- Falcon-40B: an open large language model with state-of-the-art performance.
- Pythia: A suite for analyzing large language models across training and scaling.
- Gpt-neox-20b: An open-source autoregressive language model.
- Language models are few-shot learners.
- Palm: Scaling language modeling with pathways.
- Scaling instruction-finetuned language models.
- Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
- What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems.
- Tin Kam Ho. 1995. Random decision forests. In Proceedings of 3rd international conference on document analysis and recognition, volume 1, pages 278–282. IEEE.
- Training compute-optimal large language models.
- Jaccard. 1912. The distribution of the flora of the alpine zone. In New Phytologist, volume 11, pages 37–50.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
- Simple and scalable predictive uncertainty estimation using deep ensembles.
- The language of prompting: What linguistic properties make a prompt successful? In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9210–9232, Singapore. Association for Computational Linguistics.
- Xin Li and Dan Roth. 2002. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics.
- What makes good in-context examples for gpt-3? In Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity.
- Z-icl: Zero-shot in-context learning with pseudo-demonstrations.
- Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, Dublin, Ireland. Association for Computational Linguistics.
- Noisy channel language model prompting for few-shot text classification.
- MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States. Association for Computational Linguistics.
- Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- State of what art? a call for multi-prompt llm evaluation.
- Tai Nguyen and Eric Wong. 2023. In-context example selection with influences.
- Language models are unsupervised multitask learners.
- Multitask prompted training enables zero-shot task generalization. In International Conference on Learning Representations.
- Bloom: A 176b-parameter open-access multilingual language model.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.
- AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models.
- Ben Wang and Aran Komatsuzaki. 2021. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax.
- Large language models are implicitly topic models: Explaining and finding good demonstrations for in-context learning.
- The icl consistency test.
- Albert Webson and Ellie Pavlick. 2022. Do prompt-based models really understand the meaning of their prompts? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2300–2344, Seattle, United States. Association for Computational Linguistics.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Larger language models do in-context learning differently.
- Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations.
- Jerrold H Zar. 2005. Spearman rank correlation. Encyclopedia of Biostatistics, 7.
- Opt: Open pre-trained transformer language models.
- Character-level convolutional networks for text classification. In Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
- Character-level convolutional networks for text classification. In NIPS.
- Calibrate before use: Improving few-shot performance of language models. CoRR, abs/2102.09690.