Quantify confidence for benchmark-based inference to populations
Determine the confidence level and confidence interval associated with using statistics computed from a specific benchmark (treated as a sample of evaluation conditions) to infer parameters of the entire population of evaluation conditions in real-world applications, so that benchmark-derived estimates of real-world evaluation systems can be rigorously validated.
References
Thirdly, in real-world applications, we use the statistic of a sample---a specific benchmark--- to infer the parameters of the entire population. However, we do not know their confidence levels and intervals.
— Evaluatology: The Science and Engineering of Evaluation
(2404.00021 - Zhan et al., 19 Mar 2024) in Subsection "The reflections on state-of-the-art and state-of-the-practise benchmarks and evaluation" (Section 5.3)