Instances Need More Care: Rewriting Prompts for Instances with LLMs in the Loop Yields Better Zero-Shot Performance (2310.02107v4)
Abstract: LLMs have revolutionized zero-shot task performance, mitigating the need for task-specific annotations while enhancing task generalizability. Despite its advancements, current methods using trigger phrases such as "Let's think step by step" remain limited. This study introduces PRomPTed, an approach that optimizes the zero-shot prompts for individual task instances following an innovative manner of "LLMs in the loop". Our comprehensive evaluation across 13 datasets and 10 task types based on GPT-4 reveals that PRomPTed significantly outperforms both the naive zero-shot approaches and a strong baseline (i.e., "Output Refinement") which refines the task output instead of the input prompt. Our experimental results also confirmed the generalization of this advantage to the relatively weaker GPT-3.5. Even more intriguingly, we found that leveraging GPT-3.5 to rewrite prompts for the stronger GPT-4 not only matches but occasionally exceeds the efficacy of using GPT-4 as the prompt rewriter. Our research thus presents a huge value in not only enhancing zero-shot LLM performance but also potentially enabling supervising LLMs with their weaker counterparts, a capability attracting much interest recently. Finally, our additional experiments confirm the generalization of the advantages to open-source LLMs such as Mistral 7B and Mixtral 8x7B.
- Cyner: A python library for cybersecurity named entity recognition.
- FEVEROUS: Fact extraction and VERification over unstructured and structured information.
- A general language assistant as a laboratory for alignment. ArXiv, abs/2112.00861.
- Agnes Axelsson and Gabriel Skantze. 2023. Using large language models for zero-shot natural language generation from knowledge graphs. arXiv preprint arXiv:2307.07312.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision.
- Evaluating large language models trained on code.
- Teaching large language models to self-debug.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
- Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361.
- Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874.
- Instruction induction: From few examples to natural language task descriptions. arXiv preprint arXiv:2205.10782.
- How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. arXiv preprint arXiv:2009.13081.
- Smart-llm: Smart multi-agent robot task planning using large language models. arXiv preprint arXiv:2309.10062.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Jan Leike. 2022. Why i’m optimistic about our alignment approach. Accessed Feburary, 2024.
- Let’s verify step by step. arXiv preprint arXiv:2305.20050.
- ToxicChat: Unveiling hidden challenges of toxicity detection in real-world user-AI conversation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4694–4702, Singapore. Association for Computational Linguistics.
- Gpt understands, too. arxiv. arXiv preprint arXiv:2103.10385.
- Summary of chatgpt/gpt-4 research and perspective towards the future of large language models. arXiv preprint arXiv:2304.01852.
- Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. arXiv preprint arXiv:2104.08786.
- Self-refine: Iterative refinement with self-feedback. In Thirty-seventh Conference on Neural Information Processing Systems.
- R OpenAI. 2023. Gpt-4 technical report. arXiv, pages 2303–08774.
- Grips: Gradient-free, edit-based instruction search for prompting large language models. arXiv preprint arXiv:2203.07281.
- Automatic prompt optimization with" gradient descent" and beam search. arXiv preprint arXiv:2305.03495.
- Laria Reynolds and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7.
- On second thought, let’s not think step by step! bias and toxicity in zero-shot reasoning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4454–4470, Toronto, Canada. Association for Computational Linguistics.
- Large language model alignment: A survey.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
- Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models. arXiv preprint arXiv:2305.04091.
- Promptagent: Strategic planning with language models enables expert-level prompt optimization.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171.
- Large language models are zero-shot text classifiers.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Zero-shot information extraction via chatting with chatgpt.
- Large language models as optimizers. In The Twelfth International Conference on Learning Representations.
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems, 34:27263–27277.
- Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792.
- Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910.
- Saurabh Srivastava (14 papers)
- Chengyue Huang (9 papers)
- Weiguo Fan (3 papers)
- Ziyu Yao (44 papers)