Using performance distribution estimates for model comparison across contexts
Develop comparison criteria and decision rules that utilize the performance distribution estimates across prompt templates produced by PromptEval to compare large language models in different application contexts, specifying how distributional features (e.g., quantiles) should be employed for robust evaluation.
Sponsor
References
However, several questions remain: how to decide on the set of prompts for evaluation and how to best utilize our distribution estimates for comparison in various contexts.
— Efficient multi-prompt evaluation of LLMs
(2405.17202 - Polo et al., 2024) in Section 6, Conclusion