Using performance distribution estimates for model comparison across contexts

Develop comparison criteria and decision rules that utilize the performance distribution estimates across prompt templates produced by PromptEval to compare large language models in different application contexts, specifying how distributional features (e.g., quantiles) should be employed for robust evaluation.

Background

PromptEval estimates the full distribution of performance across prompt templates and supports robust statistics such as quantiles (e.g., median, 5th, 95th). The authors show empirical accuracy and consistency guarantees for these estimates across several benchmarks.

Despite having access to distributional estimates, the authors explicitly state that it remains unresolved how best to use these estimates for model comparison in varying contexts, beyond the illustrative use of quantiles presented in this work.

References

However, several questions remain: how to decide on the set of prompts for evaluation and how to best utilize our distribution estimates for comparison in various contexts.

Efficient multi-prompt evaluation of LLMs (2405.17202 - Polo et al., 2024) in Section 6, Conclusion