Quantify confidence for benchmark-based inference to populations

Determine the confidence level and confidence interval associated with using statistics computed from a specific benchmark (treated as a sample of evaluation conditions) to infer parameters of the entire population of evaluation conditions in real-world applications, so that benchmark-derived estimates of real-world evaluation systems can be rigorously validated.

Background

The paper defines a benchmark as a pragmatic evaluation condition used to infer parameters of a real-world evaluation system by sampling from a larger population of evaluation conditions. Earlier sections (e.g., universal methodology in complex scenarios) emphasize the need to use confidence levels and intervals to assess how well a pragmatic evaluation model predicts parameters of a perfect evaluation model.

In the reflections on current practice, the authors note that widely used benchmarks (e.g., in AI and CPU evaluation) often rely on sample-based inference without quantifying the confidence level and interval of the estimates for the target population. This gap prevents rigorous validation of benchmark-derived conclusions and motivates the explicit need to establish confidence metrics for such inferences.

References

Thirdly, in real-world applications, we use the statistic of a sample---a specific benchmark--- to infer the parameters of the entire population. However, we do not know their confidence levels and intervals.

— Evaluatology: The Science and Engineering of Evaluation (2404.00021 - Zhan et al., 19 Mar 2024) in Subsection "The reflections on state-of-the-art and state-of-the-practise benchmarks and evaluation" (Section 5.3)

Quantify confidence for benchmark-based inference to populations

Background

References

Related Problems