Papers
Topics
Authors
Recent
2000 character limit reached

Towards Informative Few-Shot Prompt with Maximum Information Gain for In-Context Learning (2310.08923v1)

Published 13 Oct 2023 in cs.CL

Abstract: LLMs possess the capability to engage In-context Learning (ICL) by leveraging a few demonstrations pertaining to a new downstream task as conditions. However, this particular learning paradigm suffers from high instability stemming from substantial variances induced by factors such as the input distribution of selected examples, their ordering, and prompt formats. In this work, we demonstrate that even when all these factors are held constant, the random selection of examples still results in high variance. Consequently, we aim to explore the informative ability of data examples by quantifying the Information Gain (IG) obtained in prediction after observing a given example candidate. Then we propose to sample those with maximum IG. Additionally, we identify the presence of template bias, which can lead to unfair evaluations of IG during the sampling process. To mitigate this bias, we introduce Calibration Before Sampling strategy. The experimental results illustrate that our proposed method can yield an average relative improvement of 14.3% across six classification tasks using three LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. Robert B Ash. 2012. Information theory. Courier Corporation.
  2. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 632–642, Lisbon, Portugal. Association for Computational Linguistics.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  4. Ting-Yun Chang and Robin Jia. 2022. Careful data curation stabilizes in-context learning. arXiv preprint arXiv:2212.10378.
  5. BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
  6. Ido Dagan and Sean P Engelson. 1995. Committee-based sampling for training probabilistic classifiers. In Machine Learning Proceedings 1995, pages 150–157. Elsevier.
  7. The pascal recognising textual entailment challenge. In Machine Learning Challenges. Evaluating Predictive Uncertainty, Visual Object Classification, and Recognising Tectual Entailment: First PASCAL Machine Learning Challenges Workshop, MLCW 2005, Southampton, UK, April 11-13, 2005, Revised Selected Papers, pages 177–190. Springer.
  8. The commitmentbank: Investigating projection in naturally occurring discourse. In proceedings of Sinn und Bedeutung, volume 23, pages 107–124.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. Active Learning for BERT: An Empirical Study. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7949–7962, Online. Association for Computational Linguistics.
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
  12. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR.
  13. Selective annotation makes language models better few-shot learners. In The Eleventh International Conference on Learning Representations.
  14. Xiaonan Li and Xipeng Qiu. 2023. Finding supporting examples for in-context learning. arXiv preprint arXiv:2302.13539.
  15. What makes good in-context examples for gpt-3333? arXiv preprint arXiv:2101.06804.
  16. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8086–8098, Dublin, Ireland. Association for Computational Linguistics.
  17. Active learning by acquiring contrastive examples. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 650–663, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  18. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  19. True few-shot learning with language models. Advances in neural information processing systems, 34:11054–11070.
  20. John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74.
  21. Learning to retrieve prompts for in-context learning. arXiv preprint arXiv:2112.08633.
  22. Burr Settles. 2009. Active learning literature survey.
  23. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA. Association for Computational Linguistics.
  24. Ellen M Voorhees and Dawn M Tice. 2000. Building a question answering test collection. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pages 200–207.
  25. Better zero-shot reasoning with self-adaptive prompting. In Findings of the Association for Computational Linguistics: ACL 2023, pages 3493–3514, Toronto, Canada. Association for Computational Linguistics.
  26. Universal self-adaptive prompting. arXiv preprint arXiv:2305.14926.
  27. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32.
  28. Ben Wang and Aran Komatsuzaki. 2021. Gpt-j-6b: A 6 billion parameter autoregressive language model.
  29. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122, New Orleans, Louisiana. Association for Computational Linguistics.
  30. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1423–1436, Toronto, Canada. Association for Computational Linguistics.
  31. Cold-start data selection for few-shot language model fine-tuning: A prompt-based uncertainty propagation approach. arXiv preprint arXiv:2209.06995.
  32. Cold-start active learning through self-supervised language modeling. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7935–7948, Online. Association for Computational Linguistics.
  33. Character-level convolutional networks for text classification. Advances in neural information processing systems, 28.
  34. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9134–9148, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  35. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pages 12697–12706. PMLR.
Citations (8)

Summary

Enhancing Few-Shot Prompting with Information Gain Maximization

This paper addresses the instability of In-Context Learning (ICL) in LLMs arising from variances induced by factors such as input distribution, demonstration ordering, and prompt formats. The authors demonstrate that even when these factors are controlled, the random selection of examples still leads to high variance in performance. To mitigate this, the paper introduces a novel method to quantify the informative ability of data examples by measuring the Information Gain (IG) obtained in prediction after observing a candidate example and proposes sampling examples with maximum IG. The authors also identify and address the presence of template bias through a Calibration Before Sampling strategy. Experimental results on six classification tasks using three LLMs show an average relative improvement of 14.3\%.

Problem Formulation and Information Gain

The paper focuses on retrieving prompts from an unlabeled text dataset Dunlab={xi}i=1N\mathcal{D}_{unlab} =\{x_i\}_{i=1}^N for a specific task, aligning with true few-shot learning. The approach involves using a pre-trained LLM to predict all candidate examples in Dunlab\mathcal{D}_{unlab}, resulting in a prediction set Y={yi}i=1N\mathcal{Y}=\{\mathbf{y}_i\}_{i=1}^N, where yi\mathbf{y}_i represents the normalized predicted label distribution given input xix_i. The objective is to select a subset {xj}j=1K\{x_j\}_{j=1}^K from Dunlab\mathcal{D}_{unlab}, where K≪NK \ll N, to facilitate KK-shot learning.

The core concept is to measure the informative ability of data examples by quantifying the Information Gain (IG) of prediction. IG is defined as the information obtained in the predicted label distribution YY when observing one example candidate X=xobX=x_{ob} in Dunlab\mathcal{D}_{unlab}:

IG(Y,xob)=H(Y)−H(Y∣xob)IG(Y, x_{ob}) = H(Y) - H(Y|x_{ob})

where H(Y)H(Y) is the information entropy of YY and H(Y∣xob)H(Y|x_{ob}) is the conditional entropy of YY given the observation xobx_{ob}. Since H(Y)H(Y) remains constant for a given task, the problem is reframed as selecting examples with minimum conditional entropy H(Y∣xob)H(Y|x_{ob}). Figure 1

Figure 1: An overview of the proposed method, detailing the sampling time and test time processes.

Template Bias and Calibration Before Sampling

The paper identifies the presence of template bias, where the LLM exhibits a tendency to favor specific answers based on the template alone, even in the absence of demonstrations. This bias can lead to unfair evaluations of IG during the sampling process. To address this, the authors introduce a Calibration Before Sampling (CBS) strategy. This involves using a normalization function σ\sigma, a weight matrix W\mathbf{W}

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.