Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPELL: Semantic Prompt Evolution based on a LLM (2310.01260v1)

Published 2 Oct 2023 in cs.CL and cs.AI

Abstract: Prompt engineering is a new paradigm for enhancing the performance of trained neural network models. For optimizing text-style prompts, existing methods usually individually operate small portions of a text step by step, which either breaks the fluency or could not globally adjust a prompt. Since LLMs have powerful ability of generating coherent texts token by token, can we utilize LLMs for improving prompts? Based on this motivation, in this paper, considering a trained LLM as a text generator, we attempt to design a black-box evolution algorithm for automatically optimizing texts, namely SPELL (Semantic Prompt Evolution based on a LLM). The proposed method is evaluated with different LLMs and evolution parameters in different text tasks. Experimental results show that SPELL could rapidly improve the prompts indeed. We further explore the evolution process and discuss on the limitations, potential possibilities and future work.

This paper introduces SPELL (Semantic Prompt Evolution based on a LLM), a black-box algorithm for automatically optimizing text-based prompts for natural language processing tasks using LLMs. The core idea is to leverage the text generation capabilities of an LLM within an evolutionary framework to create better-performing and coherent prompts for a fixed target model.

Methodology: SPELL Framework

SPELL operates as an evolutionary algorithm maintaining a population of prompt candidates. It iteratively improves this population through two main steps: reproduction and selection.

  1. Reproduction:
    • This step generates new prompt candidates (offspring) based on existing prompts (parents) from the current population.
    • Parents are selected from the population (using the selection mechanism described below).
    • A "meta-prompt" (MPrMP_r) is constructed to guide an LLM (referred to as GG) in generating a new, improved prompt. This meta-prompt includes:
      • Task description ([i_{tsk}]).
      • Definition of a prompt and its goal ([i_{prompt}]).
      • Instructions for reproduction, suggesting modification strategies like word replacement, voice conversion, adding/deleting words ([i_{rep}]).
      • Examples of parent prompts ([pro_i]) along with their fitness scores ([fit_i]).
      • Final generation request ([i_{final}]).
      • Additional instructions, including a specific format for extracting the generated prompt (e.g., enclosed in curly braces {}) ([i_{additional}]).
    • The LLM (GG) processes the meta-prompt MPrMP_r and generates a response.
    • An extraction function (EXT) isolates the newly generated prompt from the LLM's response based on the specified format (e.g., text within {}).
    • This process relies on the LLM's ability to understand the context (task, parent prompts, scores) and generate a semantically coherent and potentially better prompt in a global, prompt-wise manner.
  2. Selection:
    • This step chooses which individuals (prompts) survive to form the next generation's population and which are selected as parents for reproduction.
    • It uses a roulette wheel selection mechanism based on fitness scores.
    • The fitness of a prompt is typically its performance (e.g., accuracy) on a validation set (DtrainD_{train} in the paper's few-shot setting) when used with the target model MM.
    • The selection probability PS,iP_{S,i} for an individual prompt IiI_i is calculated as PS,i=efitness(Ii)jCAN(efitness(Ij))P_{S,i} = \frac{e^{fitness(I_i)}}{\sum_{j \in CAN}(e^{fitness(I_j)})}, where CANCAN is the set of candidate prompts (current population + offspring).
    • Individuals are sampled based on these probabilities. This allows fitter individuals a higher chance of selection while still permitting less fit individuals a small chance, promoting diversity.

Experiments

  • Setup: Experiments were conducted on SST-2, RTE, and AG's News tasks in a 16-shot setting. The target model was RoBERTa-large, and the default LLM for prompt generation was Llama-2-Chat-7b. The population size was 20, evolved over 500 rounds.
  • Results:
    • SPELL improved prompt performance over zero-shot baselines but showed lower test accuracy compared to several state-of-the-art white-box (e.g., Fine-tuning, LM-BFF, DART) and black-box (e.g., BBT, RLPrompt, TEMPERA) methods on the reported tasks (Table \ref{table-methods}).
    • The optimization process showed convergence but exhibited significant fluctuations (instability) across different runs (Figure \ref{figure-process}).
    • The generated prompts were observed to be coherent and semantically meaningful, unlike some character-level optimization methods. For example, "Classify the following sentence." evolved into "Can you determine the sentiment of this sentence for me?".
    • The choice of LLM for generation significantly impacted performance. ERNIE-Bot and Llama2-Chat-13B performed better than Llama2-Chat-7B, while BLOOMZ failed to follow instructions (Table \ref{table_LLMs}).
  • Ablation Studies:
    • Using accuracy as the fitness metric worked better than using cross-entropy loss.
    • Population size affects the exploration/exploitation balance.
    • Larger training sets (kk) generally led to better performance, likely due to more reliable fitness estimation.

Conclusion and Discussion

SPELL demonstrates a novel approach to prompt optimization by integrating an LLM's generative power into an evolutionary algorithm. It enables global, semantic-level prompt modifications, resulting in coherent prompts and relatively rapid optimization compared to some methods requiring thousands of rounds. However, the method shows instability, and its effectiveness is highly dependent on the capability of the LLM used for generation. The paper suggests future work could focus on adapting LLMs specifically for this task, stabilizing the optimization process, and extending the approach beyond text. The appendix notes that contemporaneous works like OPRO and EVOPROMPT achieved better results, potentially because they targeted LLMs as the end task model, which might be more amenable to prompts generated by other LLMs compared to smaller models like RoBERTa.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yujian Betterest Li (5 papers)
  2. Kai Wu (134 papers)
Citations (8)