Deciding the set of prompts for multi-prompt evaluation
Determine a principled method for selecting the set of prompt templates to use when applying PromptEval to evaluate a large language model, ensuring that the estimated performance distribution across prompts is meaningful for robust comparison and reporting.
Sponsor
References
However, several questions remain: how to decide on the set of prompts for evaluation and how to best utilize our distribution estimates for comparison in various contexts.
— Efficient multi-prompt evaluation of LLMs
(2405.17202 - Polo et al., 2024) in Section 6, Conclusion