Deciding the set of prompts for multi-prompt evaluation

Determine a principled method for selecting the set of prompt templates to use when applying PromptEval to evaluate a large language model, ensuring that the estimated performance distribution across prompts is meaningful for robust comparison and reporting.

Background

The paper introduces PromptEval, an efficient method to estimate the distribution of a LLM’s performance across many prompt templates. This enables robust evaluation using distributional summaries such as quantiles instead of relying on single prompts, which are known to cause sensitivity and inconsistent leaderboard rankings.

While PromptEval provides the estimation machinery, the authors explicitly note that determining which prompts to include in the evaluation remains unresolved. Choosing the prompt set is crucial because it shapes the performance distribution and the resulting comparisons across models.

References

However, several questions remain: how to decide on the set of prompts for evaluation and how to best utilize our distribution estimates for comparison in various contexts.

Efficient multi-prompt evaluation of LLMs (2405.17202 - Polo et al., 2024) in Section 6, Conclusion