Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation (2303.08518v4)

Published 15 Mar 2023 in cs.CL
UPRISE: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

Abstract: LLMs are popular for their impressive abilities, but the need for model-specific fine-tuning or task-specific prompt engineering can hinder their generalization. We propose UPRISE (Universal Prompt Retrieval for Improving zero-Shot Evaluation), which tunes a lightweight and versatile retriever that automatically retrieves prompts for a given zero-shot task input. Specifically, we demonstrate universality in a cross-task and cross-model scenario: the retriever is tuned on a diverse set of tasks, but tested on unseen task types; we use a small frozen LLM, GPT-Neo-2.7B, for tuning the retriever, but test the retriever on different LLMs of much larger scales, such as BLOOM-7.1B, OPT-66B and GPT3-175B. Additionally, we show that UPRISE mitigates the hallucination problem in our experiments with ChatGPT, suggesting its potential to improve even the strongest LLMs. Our model and code are available at https://github.com/microsoft/LMOps.

Uprise: Universal Prompt Retrieval for Improving Zero-Shot Evaluation

The paper "Uprise: Universal Prompt Retrieval for Improving Zero-Shot Evaluation" introduces a novel approach to enhancing the zero-shot evaluation capabilities of LLMs through universal prompt retrieval. This method seeks to overcome the limitations associated with traditional model-specific fine-tuning and task-specific prompt engineering.

Key Contributions

  1. Universal Retrievers: The paper proposes Uprise, a versatile retriever designed to automatically select prompts from a pre-defined pool for a given zero-shot task input. It is trained on a small LLM, GPT-Neo-2.7B, and demonstrates its effectiveness on larger models like BLOOM-7.1B, OPT-66B, and GPT3-175B.
  2. Cross-Task and Cross-Model Generalization: Uprise is tested across different tasks and models, highlighting its ability to extend beyond the tasks it was initially trained on and apply to different LLM configurations. This generalization from a small to a large model is crucial for practical applications.
  3. Hallucination Mitigation: The approach also addresses the hallucination problem in models like ChatGPT, showing improved accuracy in fact-checking tasks, which underscores Uprise's potential for enhancing even the strongest LLMs.
  4. Structured Evaluation: A comprehensive evaluation is conducted across various tasks including Reading Comprehension, Closed-book QA, Paraphrase Detection, Natural Language Inference, and Sentiment Analysis. The paper presents robust numerical results indicating significant performance gains in zero-shot settings.
  5. Training Data Diversity: The paper explores the impact of diverse training data on the effectiveness of universal prompt retrieval, showing that tasks with diverse question and answer types are more generalizable.

Implications and Future Research

The introduction of Uprise has several implications for the field of AI and LLMs:

  • Scalability and Efficiency: By adopting a universal prompt retrieval approach, the need for constant model-specific fine-tuning is reduced, leading to more scalable and resource-efficient solutions.
  • Adaptive AI Systems: Uprise sets the groundwork for developing adaptive AI systems capable of handling a wide range of tasks without prior task-specific adjustments.
  • Research Directions: Future work could extend the universality of prompt retrieval to more complex AI systems involving multimodal information or external knowledge bases. Further, exploring prompt retrieval's impact on few-shot learning provides an exciting avenue for research.

Conclusion

This paper provides a thoughtful examination of universal prompt retrieval as a means to enhance zero-shot evaluation. By focusing on cross-task and cross-model generalization, Uprise offers a promising direction for improving the flexibility and efficiency of LLMs in diverse applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260.
  2. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  3. The fifth PASCAL recognizing textual entailment challenge. In TAC. NIST.
  4. Think you have solved direct-answer question answering? try arc-da, the direct-answer AI2 reasoning challenge. CoRR, abs/2102.03315.
  5. PIQA: reasoning about physical commonsense in natural language. In AAAI, pages 7432–7439. AAAI Press.
  6. GPT-Neo: Large Scale Autoregressive Language Modeling with Mesh-Tensorflow. If you use this software, please cite it using these metadata.
  7. A large annotated corpus for learning natural language inference. In EMNLP, pages 632–642. The Association for Computational Linguistics.
  8. Language models are few-shot learners. In NeurIPS.
  9. Deep reinforcement learning from human preferences. In NIPS, pages 4299–4307.
  10. Boolq: Exploring the surprising difficulty of natural yes/no questions. In NAACL-HLT (1), pages 2924–2936. Association for Computational Linguistics.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1), pages 4171–4186. Association for Computational Linguistics.
  12. William B. Dolan and Chris Brockett. 2005. Automatically constructing a corpus of sentential paraphrases. In IWP@IJCNLP. Asian Federation of Natural Language Processing.
  13. Glam: Efficient scaling of language models with mixture-of-experts. In ICML, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR.
  14. Semantic noise matters for neural natural language generation. In INLG, pages 421–426. Association for Computational Linguistics.
  15. Making pre-trained language models better few-shot learners. In ACL/IJCNLP (1), pages 3816–3830. Association for Computational Linguistics.
  16. Twitter sentiment classification using distant supervision. Processing, 150.
  17. Parameter-efficient transfer learning for NLP. In ICML, volume 97 of Proceedings of Machine Learning Research, pages 2790–2799. PMLR.
  18. Lora: Low-rank adaptation of large language models. In ICLR. OpenReview.net.
  19. Language is not all you need: Aligning perception with language models. CoRR, abs/2302.14045.
  20. Dense passage retrieval for open-domain question answering. In EMNLP (1), pages 6769–6781. Association for Computational Linguistics.
  21. Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In NAACL-HLT, pages 252–262. Association for Computational Linguistics.
  22. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466.
  23. Towards few-shot fact-checking via perplexity. In NAACL-HLT, pages 1971–1981. Association for Computational Linguistics.
  24. The power of scale for parameter-efficient prompt tuning. In EMNLP (1), pages 3045–3059. Association for Computational Linguistics.
  25. The winograd schema challenge. In KR. AAAI Press.
  26. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In ACL/IJCNLP (1), pages 4582–4597. Association for Computational Linguistics.
  27. Commongen: A constrained text generation challenge for generative commonsense reasoning. In EMNLP (Findings), volume EMNLP 2020 of Findings of ACL, pages 1823–1840. Association for Computational Linguistics.
  28. Truthfulqa: Measuring how models mimic human falsehoods. In ACL (1), pages 3214–3252. Association for Computational Linguistics.
  29. What makes good in-context examples for gpt-3? In DeeLIO@ACL, pages 100–114. Association for Computational Linguistics.
  30. GPT understands, too. CoRR, abs/2103.10385.
  31. Pointer sentinel mixture models.
  32. Can a suit of armor conduct electricity? A new dataset for open book question answering. In EMNLP, pages 2381–2391. Association for Computational Linguistics.
  33. DART: open-domain structured data record to text generation. In NAACL-HLT, pages 432–447. Association for Computational Linguistics.
  34. Annotated gigaword. In AKBC-WEKEX@NAACL-HLT, pages 95–100. Association for Computational Linguistics.
  35. Altaf Rahman and Vincent Ng. 2012. Resolving complex cases of definite pronouns: The winograd schema challenge. In EMNLP-CoNLL, pages 777–789. ACL.
  36. Know what you don’t know: Unanswerable questions for squad. In ACL (2), pages 784–789. Association for Computational Linguistics.
  37. Squad: 100, 000+ questions for machine comprehension of text. In EMNLP, pages 2383–2392. The Association for Computational Linguistics.
  38. Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In EMNLP/IJCNLP (1), pages 3980–3990. Association for Computational Linguistics.
  39. Stephen E. Robertson and Hugo Zaragoza. 2009. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3(4):333–389.
  40. Choice of plausible alternatives: An evaluation of commonsense causal reasoning. In AAAI Spring Symposium: Logical Formalizations of Commonsense Reasoning. AAAI.
  41. Learning to retrieve prompts for in-context learning. In NAACL-HLT, pages 2655–2671. Association for Computational Linguistics.
  42. Winogrande: An adversarial winograd schema challenge at scale. In AAAI, pages 8732–8740. AAAI Press.
  43. Multitask prompted training enables zero-shot task generalization. In ICLR. OpenReview.net.
  44. BLOOM: A 176b-parameter open-access multilingual language model. CoRR, abs/2211.05100.
  45. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761.
  46. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, pages 1631–1642. ACL.
  47. One embedder, any task: Instruction-finetuned text embeddings.
  48. Lamda: Language models for dialog applications. CoRR, abs/2201.08239.
  49. The FEVER2.0 shared task. In Proceedings of the Second Workshop on Fact Extraction and VERification (FEVER).
  50. Don’t prompt, search! mining-based zero-shot learning with language models. In EMNLP, pages 7508–7520. Association for Computational Linguistics.
  51. Representation learning with contrastive predictive coding. CoRR, abs/1807.03748.
  52. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR (Poster). OpenReview.net.
  53. Finetuned language models are zero-shot learners. In ICLR. OpenReview.net.
  54. Chain of thought prompting elicits reasoning in large language models. CoRR, abs/2201.11903.
  55. A broad-coverage challenge corpus for sentence understanding through inference. In NAACL-HLT, pages 1112–1122. Association for Computational Linguistics.
  56. Compositional exemplars for in-context learning.
  57. Retrieval of soft prompt enhances zero-shot task generalization. arXiv preprint arXiv:2210.03029.
  58. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In ACL (2), pages 1–9. Association for Computational Linguistics.
  59. Hellaswag: Can a machine really finish your sentence? In ACL (1), pages 4791–4800. Association for Computational Linguistics.
  60. Rui Zhang and Joel R. Tetreault. 2019. This email could save your life: Introducing the task of email subject line generation. In ACL (1), pages 446–456. Association for Computational Linguistics.
  61. OPT: open pre-trained transformer language models. CoRR, abs/2205.01068.
  62. Character-level convolutional networks for text classification. In NIPS, pages 649–657.
  63. PAWS: paraphrase adversaries from word scrambling. In NAACL-HLT (1), pages 1298–1308. Association for Computational Linguistics.
  64. Multimodal chain-of-thought reasoning in language models. CoRR, abs/2302.00923.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Daixuan Cheng (8 papers)
  2. Shaohan Huang (79 papers)
  3. Junyu Bi (3 papers)
  4. Yuefeng Zhan (10 papers)
  5. Jianfeng Liu (26 papers)
  6. Yujing Wang (53 papers)
  7. Hao Sun (383 papers)
  8. Furu Wei (291 papers)
  9. Denvy Deng (9 papers)
  10. Qi Zhang (784 papers)
Citations (53)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Youtube Logo Streamline Icon: https://streamlinehq.com