Papers
Topics
Authors
Recent
Search
2000 character limit reached

Improving Probability-based Prompt Selection Through Unified Evaluation and Analysis

Published 24 May 2023 in cs.CL | (2305.14877v2)

Abstract: Previous works in prompt engineering for LLMs have introduced different gradient-free probability-based prompt selection methods that aim to choose the optimal prompt among the candidates for a given task but have failed to provide a comprehensive and fair comparison between each other. In this paper, we propose a unified framework to interpret and evaluate the existing probability-based prompt selection methods by performing extensive experiments on 13 common and diverse NLP tasks. We find that each of the existing methods can be interpreted as some variant of the method that maximizes mutual information between the input and the predicted output (MI). Utilizing this finding, we develop several other combinatorial variants of MI and increase the effectiveness of the oracle prompt selection method from 87.79% to 94.98%, measured as the ratio of the performance of the selected prompt to that of the optimal oracle prompt. Furthermore, considering that all the methods rely on the output probability distribution of the model that might be biased, we propose a novel calibration method called Calibration by Marginalization (CBM) that is orthogonal to the existing methods and helps increase the prompt selection effectiveness of the best method to 96.85%, achieving 99.44% of the oracle prompt F1 without calibration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Promptsource: An integrated development environment and repository for natural language prompts. In ACL System Demonstrations.
  2. TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification. In Findings of EMNLP.
  3. Piqa: Reasoning about physical commonsense in natural language. In AAAI.
  4. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow.
  5. Language models are few-shot learners. In NeurIPS.
  6. Ting-Yun Chang and Robin Jia. 2023. Data curation alone can stabilize in-context learning. In ACL.
  7. RLPrompt: Optimizing discrete text prompts with reinforcement learning. In EMNLP.
  8. A survey on in-context learning. arXiv preprint arXiv:2301.00234v3.
  9. Making pre-trained language models better few-shot learners. In ACL.
  10. Demystifying prompts in language models via perplexity estimation. In EMNLP.
  11. Surface form competition: Why the highest probability answer isn’t always right. In EMNLP.
  12. How can we know what language models know? TACL.
  13. Self-Generated In-Context learning: Leveraging auto-regressive language models as a demonstration generator. In NAACL Workshop on Large-scale Pre-trained Language Models.
  14. Sawan Kumar and Partha Talukdar. 2021. Reordering examples helps during priming-based Few-Shot learning. In Findings of ACL.
  15. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In ACL.
  16. Zero-Label prompt selection. arXiv preprint arXiv:2211.04668v1.
  17. What makes good In-Context examples for GPT-3? In ACL Workshop on Deep Learning Inside Out (DeeLIO).
  18. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys.
  19. Fantastically ordered prompts and where to find them: Overcoming Few-Shot prompt order sensitivity. In ACL.
  20. Z-icl: Zero-shot in-context learning with pseudo-demonstrations. ACL.
  21. Learning word vectors for sentiment analysis. In ACL.
  22. Reframing instructional prompts to GPTk’s language. In Findings of ACL.
  23. Semeval-2018 task 1: Affect in tweets. In SemEval.
  24. N. Moniz and L. Torgo. 2018. Multi-source social feedback of online news feeds. arXiv preprint arXiv:1801.07055v1.
  25. Lsdsem 2017 shared task: The story cloze test. In Workshop on Linking Computational Models of Lexical, Sentential and Discourse-level Semantics (LSDSem).
  26. OpenAI. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774v3.
  27. GrIPS: Gradient-free, edit-based instruction search for prompting large language models. In EACL.
  28. Language models are unsupervised multitask learners.
  29. Learning to retrieve prompts for In-Context learning. In NAACL.
  30. Multitask prompted training enables zero-shot task generalization. In ICLR.
  31. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP.
  32. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP.
  33. An information-theoretic approach to prompt engineering without ground truth labels. In ACL.
  34. Semeval-2018 task 3: Irony detection in english tweets. In SemEval.
  35. Superglue: A stickier benchmark for general-purpose language understanding systems. arXiv preprint arXiv:1905.00537v3.
  36. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR.
  37. Self-instruct: Aligning language model with self generated instructions. In ACL.
  38. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100v4.
  39. Self-adaptive in-context learning: An information compression perspective for in-context example selection and ordering. In ACL.
  40. GPS: Genetic prompt search for efficient few-shot learning. In EMNLP.
  41. Hellaswag: Can a machine really finish your sentence? In ACL.
  42. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068v4.
  43. Character-level convolutional networks for text classification. In NeurIPS.
  44. Active example selection for In-Context learning. In EMNLP.
  45. Calibrate before use: Improving Few-Shot performance of language models. In ICML.
Citations (9)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.