Large Language Models Are Human-Level Prompt Engineers (2211.01910v2)

Published 3 Nov 2022 in cs.LG, cs.AI, and cs.CL

Abstract: By conditioning on natural language instructions, LLMs have displayed impressive capabilities as general-purpose computers. However, task performance depends significantly on the quality of the prompt used to steer the model, and most effective prompts have been handcrafted by humans. Inspired by classical program synthesis and the human approach to prompt engineering, we propose Automatic Prompt Engineer (APE) for automatic instruction generation and selection. In our method, we treat the instruction as the "program," optimized by searching over a pool of instruction candidates proposed by an LLM in order to maximize a chosen score function. To evaluate the quality of the selected instruction, we evaluate the zero-shot performance of another LLM following the selected instruction. Experiments on 24 NLP tasks show that our automatically generated instructions outperform the prior LLM baseline by a large margin and achieve better or comparable performance to the instructions generated by human annotators on 19/24 tasks. We conduct extensive qualitative and quantitative analyses to explore the performance of APE. We show that APE-engineered prompts can be applied to steer models toward truthfulness and/or informativeness, as well as to improve few-shot learning performance by simply prepending them to standard in-context learning prompts. Please check out our webpage at https://sites.google.com/view/automatic-prompt-engineer.

PDF Abstract

LLMs are Human-Level Prompt Engineers: An Overview

This essay provides a detailed overview of the research presented in "LLMs are Human-Level Prompt Engineers" by Zhou et al. The paper introduces a novel method called Automatic Prompt Engineer (APE), which leverages LLMs to automate the generation and selection of prompts to maximize their performance across various natural language tasks.

Methodology

The authors frame prompt engineering as a natural language program synthesis problem, structured as a black-box optimization task. This approach involves two primary components: proposal generation and scoring.

Proposal Generation: The LLMs are used to generate candidate instructions based on input-output pairs. Two modes of generation are employed:
- Forward Mode Generation: Instructions are directly inferred from input-output demonstrations.
- Reverse Mode Generation: An LLM with infilling capabilities suggests instructions to complete the task when provided with partial input.
Scoring Function: Various scoring functions are employed to evaluate the quality of the generated instructions. These functions measure how well a generated prompt aligns with the desired task performance using metrics such as execution accuracy and log probability of outputs.

The APE process iteratively refines the prompts using Monte Carlo search methods by generating new candidate instructions based on the highest-scoring candidates from previous rounds.

Results

The research rigorously evaluates APE across several benchmarks, including the Instruction Induction benchmark and curated BIG-Bench tasks. Key findings from the evaluations are:

Instruction Induction Benchmark: APE-generated instructions outperform default LLM baselines and achieve performance on par or better than human-generated instructions across 24 tasks. Notably, APE reached human-level performance on all tasks.
BIG-Bench Tasks: APE demonstrated superior performance on 17 out of 21 tasks when compared to default human-generated prompts.
TruthfulQA: APE showed significant improvements in generating prompts that guided LLMs towards higher truthfulness and informativeness in responses.
Zero-shot Chain-of-Thought: APE optimizes chain-of-thought prompts, enhancing LLM performance on complex reasoning tasks like arithmetic and word problems.

Implications

The implications of this research are both practical and theoretical:

Practical Implications: APE reduces the human effort required in crafting effective prompts, making LLMs more accessible and efficient. This automation is particularly beneficial given the scalability and generality of LLMs in diverse applications. Improved prompt efficiency directly translates to lower computational costs in deploying LLMs for various tasks.
Theoretical Implications: The paper positions LLMs as capable entities for program synthesis, extending their utility beyond mere text generation to intelligent and context-aware prompt creation. This paradigm shift promotes viewing LLMs through the lens of program execution, potentially influencing future research directions in AI.

Future Directions

The research opens several avenues for future exploration:

Enhanced Scoring Functions: Fine-tuning and developing more sophisticated scoring metrics could yield even better alignment between generated prompts and task requirements.
Transferability of Prompts: Exploring the potential for prompts generated by APE to generalize across different LLMs can further the understanding of compatibility and efficiency.
Integration with In-Context Learning: Combining APE with few-shot in-context learning promises to boost performance by optimizing the interplay between instructions and contextual examples.

Conclusion

The research presented by Zhou et al. marks a significant advancement in the automation of prompt engineering using LLMs. APE demonstrates the potential to surpass human-led prompt generation, significantly enhancing the usability and efficiency of LLMs. The implications of this work extend beyond current capabilities, suggesting a future where AI models can autonomously adapt and optimize their behaviors through intelligent prompt engineering.