Optimizing Instructions and Demonstrations for Multi-Stage Language Model Programs (2406.11695v2)

Published 17 Jun 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLM Programs, i.e. sophisticated pipelines of modular LLM (LM) calls, are increasingly advancing NLP tasks, but they require crafting prompts that are jointly effective for all modules. We study prompt optimization for LM programs, i.e. how to update these prompts to maximize a downstream metric without access to module-level labels or gradients. To make this tractable, we factorize our problem into optimizing the free-form instructions and few-shot demonstrations of every module and introduce several strategies to craft task-grounded instructions and navigate credit assignment across modules. Our strategies include (i) program- and data-aware techniques for proposing effective instructions, (ii) a stochastic mini-batch evaluation function for learning a surrogate model of our objective, and (iii) a meta-optimization procedure in which we refine how LMs construct proposals over time. Using these insights we develop MIPRO, a novel algorithm for optimizing LM programs. MIPRO outperforms baseline optimizers on five of seven diverse multi-stage LM programs using a best-in-class open-source model (Llama-3-8B), by as high as 13% accuracy. We have released our new optimizers and benchmark in DSPy at http://dspy.ai

PDF HTML Abstract

This paper (Opsahl-Ong et al., 17 Jun 2024 ) addresses the critical challenge of optimizing LLM Programs (LMPs), which are multi-stage pipelines of LLM calls used to solve complex tasks. Currently, building effective LMPs often relies on extensive manual prompt engineering, a trial-and-error process that becomes increasingly difficult as the number of modules grows. The research focuses on developing automated methods to optimize the prompts – specifically the free-form instructions and few-shot demonstrations – for all modules in an LMP simultaneously, without requiring module-level labels, gradients, or log probabilities.

The core problem is framed as finding a configuration of instructions and demonstrations across all modules that maximizes a downstream performance metric on a given training dataset. This is challenging because the search space is vast, performance signals are only available at the final output stage (credit assignment across modules is difficult), and practical applications often have limited data and computational budgets for evaluating the full LMP.

The paper proposes a general optimization framework (Algorithm 1) where an optimizer iteratively:

Initializes its state based on the training data and initial program prompts.
Proposes a new set of prompt configurations (instructions and/or demonstrations) for one or more modules.
Evaluates the proposed configuration by running the full LMP on a (possibly mini-batch of) training data and measuring the downstream metric.
Updates its internal state based on the evaluation score.
Extracts the best-performing configuration found during the process.

To make this process tractable, the authors identify two key challenges and propose strategies:

1. The Proposal Problem: How to efficiently generate high-quality candidate prompts?

Bootstrapping Demonstrations: Rejection sampling successful input/output traces from the LMP running on training data to generate potential few-shot examples for each module. This leverages the LMP's own behavior to create data.
Grounding: Providing context to an LM (used as a "proposer" or "meta-optimizer") when asking it to generate instructions. This context can include summaries of the dataset's characteristics, the LMP's control flow, examples of successful bootstrapped traces, and a history of previously evaluated instructions and their scores. This helps the proposer LM generate instructions relevant to the specific task and program dynamics.
Learning To Propose: Meta-optimizing the hyperparameters of the instruction and demonstration proposal process (e.g., the temperature of the proposer LM, whether to include specific grounding elements) using a Bayesian model. This adapts the proposal strategy to the specific task and program.

2. The Credit Assignment Problem: How to determine which module's prompt changes contributed to the overall performance change?

Greedy: Optimizing modules one by one. (Found to be inefficient in preliminary experiments).
Surrogate: Using a Bayesian optimization model (like Tree-structured Parzen Estimator - TPE) to model the relationship between prompt configurations across modules and the overall performance metric. This model learns to predict the quality of combinations and guides the search towards promising regions. This is a core component of the proposed MIPRO optimizer. Evaluating on mini-batches of data further improves efficiency by reducing the cost of each evaluation step.
History-Based: Providing a history of prompt configurations and their overall performance scores to the proposer LM, relying on the LM to infer which changes were beneficial (as in OPRO).

Based on these strategies, the paper introduces and evaluates several optimizers:

Bootstrap Random Search: A baseline that optimizes only few-shot demonstrations by randomly sampling from a pool of bootstrapped examples. Simple and surprisingly effective.
Module-Level OPRO: An extension of OPRO where an LM proposes instructions for each module, using module-specific histories and assuming equal credit assignment across modules.
MIPRO (Multi-prompt Instruction Proposal Optimizer): The main proposed method. It jointly optimizes instructions and demonstrations across multiple modules using a Bayesian surrogate model (TPE) for credit assignment and mini-batch evaluation.
MIPRO variants:
- 0-Shot MIPRO: Optimizes only instructions using the MIPRO framework.
- Bayesian Bootstrap: Optimizes only demonstrations using the MIPRO framework.
- MIPRO++: Uses a Bayesian model to meta-optimize the hyperparameters of the instruction and demonstration proposal strategies themselves (Learning To Propose).

For practical evaluation, the authors introduce a benchmark of six diverse tasks, including multi-hop question answering (HotPotQA), multi-hop claim verification (HoVer), classification (Iris, Heart Disease), and natural language inference (ScoNe). These tasks involve LMPs with 1 to 4 modules and up to 4 LM calls, designed to stress different aspects of optimization (multi-stage, conditional rules, complex reasoning). The benchmark uses Llama3-8B as the task LM and GPT-3.5/GPT-4o as the proposer LM.

Key Practical Lessons from Experiments:

Bootstrapped demonstrations are highly impactful: Optimizing few-shot examples generated from successful program runs is often more effective than optimizing instructions alone, suggesting that concrete examples provide crucial guidance to the LM task model.
Joint optimization is generally best: MIPRO, which optimizes both instructions and demonstrations together, tends to outperform methods optimizing only one or the other.
Instructions matter for complex rules: Instruction optimization is particularly valuable for tasks with intricate conditional logic (like HotPotQA Conditional) that are hard to learn solely from few-shot examples. Starting with a seed instruction outlining these rules is helpful as current optimizers struggle to infer complex rules from scratch.
Grounding helps, but tailor it: Providing context (dataset summaries, program summaries, etc.) to the instruction proposer LM generally improves performance, but the optimal set of grounding information varies by task. MIPRO++'s ability to learn which grounding elements are important for a specific task is beneficial.
Optimization is a complex space: The relative performance of different optimizers (e.g., OPRO variants vs. MIPRO variants) can be mixed, suggesting that factors like optimization budget and task characteristics influence which method is most effective.

The paper demonstrates that MIPRO and related strategies significantly improve the performance of Llama3-8B on these diverse LMP tasks compared to unoptimized baselines and other methods like OPRO extensions, achieving improvements of up to 13% accuracy. The optimizers and benchmark are planned for release in the DSPy library, enabling practitioners to apply these techniques to their own LMPs.

Implementation Considerations:

Computational Cost: Running LMPs for evaluation is the main cost. Mini-batch evaluation in MIPRO helps reduce this, allowing more trials within a fixed budget of total LM calls. The number of trials required depends on the task complexity and desired performance gain.
LM Selection: The choice of both the task LM (within the LMP) and the proposer LM (used by the optimizer) is crucial. A capable task LM is needed for the program to function, and a capable proposer LM is needed to generate useful instructions and demonstrations, especially for complex tasks or when grounding is used.
Data Requirements: While intermediate labels are not needed, a training dataset of inputs (and final outputs/metadata for the metric) is required to evaluate the LMP performance. The quality and size of this dataset influence the optimizer's effectiveness, particularly for bootstrapping demonstrations.
Initialization: The quality of initial seed instructions or the pool of bootstrapped demonstrations can impact the optimization trajectory. The paper notes that current methods have limited ability to infer complex task rules without a good seed instruction.
Hyperparameter Tuning: Optimizers have their own hyperparameters (e.g., number of demonstrations, mini-batch size, proposer LM temperature). MIPRO++ offers a way to automate tuning some of these, but manual tuning might still be needed.

In summary, this research provides a practical framework and concrete algorithms, notably MIPRO, for automatically optimizing the prompts of multi-stage LLM programs. By addressing the proposal and credit assignment challenges through strategies like bootstrapping, grounding, surrogate modeling, and meta-optimization, the work moves beyond manual prompt engineering and enables more robust and efficient development of complex AI systems built with LMs. The accompanying benchmark facilitates future research and development in this area.