Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unleashing the Potential of Large Language Models as Prompt Optimizers: An Analogical Analysis with Gradient-based Model Optimizers (2402.17564v2)

Published 27 Feb 2024 in cs.CL
Unleashing the Potential of Large Language Models as Prompt Optimizers: An Analogical Analysis with Gradient-based Model Optimizers

Abstract: Automatic prompt optimization is an important approach to improving the performance of LLMs. Recent research demonstrates the potential of using LLMs as prompt optimizers, which can generate improved task prompts via iterative refinement. In this paper, we propose a novel perspective to investigate the design of LLM-based prompt optimizers, by drawing an analogy with gradient-based model optimizers. To connect these two approaches, we identify two pivotal factors in model parameter learning: update direction and update method. Focused on the two aspects, we borrow the theoretical framework and learning methods from gradient-based optimization to design improved strategies for LLM-based prompt optimizers. By systematically analyzing a rich set of improvement strategies, we further develop a capable Gradient-inspired LLM-based Prompt Optimizer called GPO. At each step, it first retrieves relevant prompts from the optimization trajectory as the update direction. Then, it utilizes the generation-based refinement strategy to perform the update, while controlling the edit distance through a cosine-based decay strategy. Extensive experiments demonstrate the effectiveness and efficiency of GPO. In particular, GPO brings an additional improvement of up to 56.8% on Big-Bench Hard and 55.3% on MMLU compared to baseline methods.

The paper "Unleashing the Potential of LLMs as Prompt Optimizers: An Analogical Analysis with Gradient-based Model Optimizers" introduces a novel perspective on designing LLM-based prompt optimizers by drawing an analogy with gradient-based model optimizers. The paper identifies two pivotal factors in model parameter learning: update direction and update method and then borrows theoretical frameworks and learning methods from gradient-based optimization to design improved strategies for LLM-based prompt optimizers. The authors develop a Gradient-inspired LLM-based Prompt Optimizer called GPO and demonstrate its effectiveness and efficiency through experiments.

Here's a more detailed breakdown:

Introduction

The paper addresses the challenge of prompt engineering for LLMs, which is difficult because LLMs are sensitive to prompts. Automatic prompt optimization has been proposed to improve the task performance of LLMs. Recent work models the optimization problem in natural language and uses LLMs as prompt optimizers. The paper aims to investigate the design of meta-prompts. The authors are inspired by the success of gradient-based optimizers in model optimization and aim to connect the two approaches via analogical analysis.

Analogical Analysis

The authors draw inspiration from gradient-based model optimizers to conduct a systematic analysis of LLM-based prompt optimizers. The key idea is to draw connections between model optimization and prompt optimization to improve existing LLM-based prompt optimizers.

Task Formulation: The paper defines the prompt optimization problem as finding the optimal task prompt pp^* that maximizes performance on a task dataset D\mathcal{D} using an LLM as the task model MT\mathcal{M}_T. This optimization is performed by an LLM-based prompt optimizer MO\mathcal{M}_O, which requires a meta-prompt to guide the optimization process. The problem is formulated as:

$p^* = \mathop{\arg\max} \limits_{p \sim \mathcal{M}_O} \ \mathbb{E}_{\langle x,y \rangle \in \mathcal{D} \ [F(\mathcal{M}_T(x;p), y)]$,

where:

  • pp is the prompt generated by the LLM-based prompt optimizer MO\mathcal{M}_O
  • MT(x;p)\mathcal{M}_T(x; p) represents the output from the task model for input xx conditioned on the prompt pp
  • F()F(\cdot) calculates the task performance based on some measurement.

Analogical Prompt Optimization Strategies: The paper identifies two key factors: update direction and update method.

  • Update Direction:
    • Analogical "Gradient" Forms: The paper considers two forms to implicitly support the gradient-like function:
    • Prompt+performance: Including the last-round task prompt and the corresponding model performance into the meta-prompt.
    • Prompt+performance+reflection: Leveraging the reflection capability of LLMs.
    • Analogical "Momentum" Forms: The paper considers enhancing the basic form of meta-prompt by leveraging the intermediate results accumulated in the prompt optimization process:
    • Summarization-based trajectory: Summarizing the intermediate results from the optimization trajectory.
    • Retrieval-based trajectory: Dynamically retrieving kk pieces of gradients from the optimization trajectory.
    • Recency: selecting kk nearest gradients
    • Relevance: selecting kk most relevant gradients
    • Importance: selecting kk most important gradients
  • Update Method:
    • Prompt Variation Control: The paper controls the variation degree of prompt optimization, which is measured by the edit distance between two task prompts at consecutive iterations.
    • Decay-based constraint: Gradually reducing the maximum edit distance.
    • Warmup-based constraint: Gradually increasing the constraint for the maximum edit distance to its initially set value in the initial 5% steps.
    • Prompt Refinement Strategy: The paper introduces two methods to update the task prompt:
    • Editing-based refinement: Directly editing the last-round task prompt to improve performance.
    • Generation-based refinement: Leveraging the in-context learning capability of LLMs to generate refined task prompts.

Analogical Analysis Experiments: The paper conducts experiments to analyze the effectiveness of different strategies for update direction and update method. A dataset is selected from each type of task in Big-Bench Hard (BBH) to create a lite BBH benchmark for the analysis: i) Navigate (binary choice); ii) Movie Recommendation (multiple choice); iii) Object Counting (numeric response); iv) Word Sorting (free response). Llama-2-7b-chat is employed as the task model and gpt-3.5-turbo as the prompt optimizer.

GPO: Gradient-inspired LLM-based Prompt Optimizer

The authors present a novel gradient-inspired LLM-based prompt optimizer called GPO. GPO performs prompt optimization through a multi-step iterative process. At each step, the LLM first generates multiple candidate task prompts based on a meta-prompt and then the task prompt with the best performance is selected for the next iteration. The meta-prompt consists of two key components: update direction and update method. For the update direction, the approach leverages the retrieval-based optimization trajectory. For the update method, the approach employs the generation-based refinement strategy and also implements the cosine-based decay strategy to control the edit distance between task prompts at consecutive iterations.

Experiments

The paper sets up experiments to evaluate the performance of GPO across various tasks and evaluation settings.

Experimental Setup: The paper selects datasets from three groups of tasks: Big-Bench Hard (BBH) and GSM8K for complex reasoning tasks, MMLU for knowledge-intensive tasks, and WSC and WebNLG for common NLP tasks. Several representative methods are selected for comparison, including existing LLM-based prompt optimizers and one adapted from gradient-based model optimizers: (1) SGDM, (2) APE, (3) APO, (4) OPRO, (5) PE2. The evaluation metrics include the average accuracy of all the subtasks for BBH and MMLU, accuracy for GSM8K, and ROUGE-L for WSC and WebNLG.

Main Results: The results show that GPO achieves the best performance across all tasks. Under various evaluation settings for the lite BBH benchmark, GPO not only excels in the "Instruction" setting but also yields gains in the "Instruction + Demonstration" setting for both the base model and the instruction-tuned variant.

Detailed Analysis: The paper conducts a detailed analysis of GPO from the following aspects: the impact of model selection, the efficiency of optimization, the impact of initial prompts, and the generalizability of optimized prompts.

Related Work

The work is related to prompt engineering and optimization and LLM-based prompt optimizers.

Conclusion

The paper presents GPO, a novel gradient-inspired LLM-based prompt optimizer. It utilizes LLMs to automatically optimize prompts, drawing inspiration from gradient-based model optimization techniques. Through extensive experiments, GPO demonstrates remarkable capabilities for prompt optimization across diverse tasks, models, and evaluation settings and surpasses competitive baselines while consuming fewer tokens.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Survey of optimization algorithms in modern neural networks. Mathematics, 11(11):2466.
  2. Xavier Amatriain. 2024. Prompt design and engineering: Introduction and advanced methods. CoRR, abs/2401.14423.
  3. Stephen P. Boyd and Lieven Vandenberghe. 2014. Convex Optimization. Cambridge University Press.
  4. Instructzero: Efficient instruction optimization for black-box large language models. CoRR, abs/2306.03082.
  5. Training verifiers to solve math word problems. CoRR, abs/2110.14168.
  6. Agent instructs large language models to be general zero-shot reasoners. CoRR, abs/2310.03710.
  7. Rlprompt: Optimizing discrete text prompts with reinforcement learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 3369–3391. Association for Computational Linguistics.
  8. Promptbreeder: Self-referential self-improvement via prompt evolution. CoRR, abs/2309.16797.
  9. Creating training corpora for NLG micro-planners. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 179–188. Association for Computational Linguistics.
  10. A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  11. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677.
  12. Connecting large language models with evolutionary algorithms yields powerful prompt optimizers. CoRR, abs/2309.08532.
  13. Measuring massive multitask language understanding. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.
  14. Large language models cannot self-correct reasoning yet. CoRR, abs/2310.01798.
  15. Averaging weights leads to wider optima and better generalization. In Proceedings of the Thirty-Fourth Conference on Uncertainty in Artificial Intelligence, UAI 2018, Monterey, California, USA, August 6-10, 2018, pages 876–885. AUAI Press.
  16. Large language models are zero-shot reasoners. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  17. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 3045–3059. Association for Computational Linguistics.
  18. The winograd schema challenge. In Principles of Knowledge Representation and Reasoning: Proceedings of the Thirteenth International Conference, KR 2012, Rome, Italy, June 10-14, 2012. AAAI Press.
  19. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 4582–4597. Association for Computational Linguistics.
  20. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  21. Use your INSTINCT: instruction optimization using neural bandits coupled with transformers. CoRR, abs/2310.02905.
  22. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 8086–8098. Association for Computational Linguistics.
  23. Are large language models good prompt optimizers? CoRR, abs/2402.02101.
  24. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  25. Grips: Gradient-free, edit-based instruction search for prompting large language models. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023, pages 3827–3846. Association for Computational Linguistics.
  26. Automatic prompt optimization with "gradient descent" and beam search. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7957–7968. Association for Computational Linguistics.
  27. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 4222–4235. Association for Computational Linguistics.
  28. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. CoRR, abs/2206.04615.
  29. A survey of optimization methods from a machine learning perspective. IEEE Trans. Cybern., 50(8):3668–3681.
  30. On the importance of initialization and momentum in deep learning. In Proceedings of the 30th International Conference on Machine Learning, ICML 2013, Atlanta, GA, USA, 16-21 June 2013, volume 28 of JMLR Workshop and Conference Proceedings, pages 1139–1147. JMLR.org.
  31. Challenging big-bench tasks and whether chain-of-thought can solve them. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023, pages 13003–13051. Association for Computational Linguistics.
  32. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  33. Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5085–5109. Association for Computational Linguistics.
  34. Larger language models do in-context learning differently. CoRR, abs/2303.03846.
  35. C-pack: Packaged resources to advance general chinese embedding. CoRR, abs/2309.07597.
  36. GPS: genetic prompt search for efficient few-shot learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 8162–8171. Association for Computational Linguistics.
  37. Large language models as optimizers. CoRR, abs/2309.03409.
  38. Instoptima: Evolutionary multi-objective instruction optimization via large language model-based instruction operators. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 13593–13602. Association for Computational Linguistics.
  39. Prompt engineering a prompt engineer. CoRR, abs/2311.05661.
  40. TEMPERA: test-time prompt editing via reinforcement learning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
  41. A survey of large language models. CoRR, abs/2303.18223.
  42. Calibrate before use: Improving few-shot performance of language models. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 12697–12706. PMLR.
  43. Large language models are human-level prompt engineers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xinyu Tang (20 papers)
  2. Xiaolei Wang (44 papers)
  3. Wayne Xin Zhao (196 papers)
  4. Siyuan Lu (44 papers)
  5. Yaliang Li (117 papers)
  6. Ji-Rong Wen (299 papers)
Citations (8)