Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning (2301.11916v4)
Abstract: In recent years, pre-trained LLMs have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular LLM pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.
- What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
- Rethinking the role of scale for in-context learning: An interpretability-based case study at 66 billion scale. arXiv preprint arXiv:2212.09095, 2022.
- Latent dirichlet allocation. J. Mach. Learn. Res., 3(null):993–1022, mar 2003. ISSN 1532-4435.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Data distributional properties drive emergent few-shot learning in transformers. arXiv preprint arXiv:2205.05055, 2022.
- Semeval-2019 task 3: Emocontext contextual emotion detection in text. In Proceedings of the 13th International Workshop on Semantic Evaluation, pages 39–48, Minneapolis, Minnesota, USA, 2019. Association for Computational Linguistics. doi: 10.18653/v1/S19-2005. URL https://www.aclweb.org/anthology/S19-2005.
- It takes one to tango but more make trouble? in-context training with different number of demonstrations. arXiv preprint arXiv:2303.08119, 2023.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Why can gpt learn in-context? language models secretly perform gradient descent as meta optimizers. arXiv preprint arXiv:2212.10559, 2022.
- A probabilistic theory of pattern recognition. In Stochastic Modelling and Applied Probability, 1996.
- M. Hahn and N. Goyal. A theory of emergent in-context learning as implicit structure induction. arXiv preprint arXiv:2303.07971, 2023.
- Understanding in-context learning via supportive pretraining data. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12660–12673, 2023.
- H. Jiang. A latent space theory for emergent abilities in large language models. arXiv preprint arXiv:2304.09960, 2023.
- Evaluating distributional distortion in neural language modeling. In International Conference on Learning Representations, 2022.
- Dbpedia–a large-scale, multilingual knowledge base extracted from wikipedia. Semantic web, 6(2):167–195, 2015.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, 2021.
- Transformers as algorithms: Generalization and implicit model selection in in-context learning. arXiv preprint arXiv:2301.07067, 2023.
- What makes good in-context examples for GPT-3? In Proceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pages 100–114, Dublin, Ireland and Online, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.deelio-1.10. URL https://aclanthology.org/2022.deelio-1.10.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, 2022.
- Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65, 2014.
- Neural variational inference for text processing. In M. F. Balcan and K. Q. Weinberger, editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1727–1736, New York, New York, USA, 20–22 Jun 2016. PMLR. URL https://proceedings.mlr.press/v48/miao16.html.
- Discovering discrete latent topics with neural variational inference. In International conference on machine learning, pages 2410–2419. PMLR, 2017.
- Noisy channel language model prompting for few-shot text classification. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5316–5330, 2022a.
- MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States, July 2022b. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.201. URL https://aclanthology.org/2022.naacl-main.201.
- Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, 2022c.
- Ethos: an online hate speech detection dataset, 2020.
- Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
- True few-shot learning with language models. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, editors, Advances in Neural Information Processing Systems, 2021. URL https://openreview.net/forum?id=ShnM-rRh4T.
- Language models are unsupervised multitask learners. 2019.
- N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
- Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633, 2021.
- CARER: Contextualized affect representations for emotion recognition. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3687–3697, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics. doi: 10.18653/v1/D18-1404. URL https://www.aclweb.org/anthology/D18-1404.
- A mathematical exploration of why language models help solve downstream tasks. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=vVjIW3sEc1s.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, Oct. 2013. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Selective annotation makes language models better few-shot learners. arXiv preprint arXiv:2209.01975, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. 7 2023b. URL http://arxiv.org/abs/2307.09288.
- L. van der Maaten and G. Hinton. Visualizing data using t-sne. Journal of Machine Learning Research, 9(86):2579–2605, 2008. URL http://jmlr.org/papers/v9/vandermaaten08a.html.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. EMNLP 2018, page 353, 2018.
- B. Wang and A. Komatsuzaki. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model. https://github.com/kingoflolz/mesh-transformer-jax, May 2021.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171, 2022.
- Neural network acceptability judgments. arXiv preprint arXiv:1805.12471, 2018.
- Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Neural Information Processing Systems (NeurIPS), 2021.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- An explanation of in-context learning as implicit bayesian inference. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=RdJVFCHjUMI.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
- Character-level convolutional networks for text classification. Advances in neural information processing systems, 28, 2015.
- Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625, 2022.
- Xinyi Wang (152 papers)
- Wanrong Zhu (30 papers)
- Michael Saxon (27 papers)
- Mark Steyvers (18 papers)
- William Yang Wang (254 papers)