Extent of Genuine Capability vs. Benchmark Defects in HLE Performance Differences

Determine the extent to which observed cross-model performance differences on the original Humanity's Last Exam (HLE) benchmark are attributable to genuine model capability gaps rather than sensitivity to benchmark defects such as ambiguous or underspecified problem statements, incorrect reference answers, or inconsistencies between rationales and final answers.

Background

Humanity's Last Exam (HLE) is widely used to assess model reasoning on challenging, multi-domain questions, but community audits have highlighted annotation noise that could bias evaluation. The paper notes that, prior to their work, the prevalence and impact of such errors on HLE were not systematically quantified.

This uncertainty motivates the development of HLE-Verified to improve reliability; however, the broader research question remains to rigorously determine how much of the observed performance variation across models on the original HLE reflects true capability differences versus artifacts induced by benchmark defects.

References

To date, such concerns have not been systematically quantified for HLE, nor has their impact on evaluation outcomes been rigorously characterized. Consequently, it remains unclear to what extent observed performance differences on HLE reflect genuine capability gaps versus sensitivity to benchmark defects.

HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam  (2602.13964 - Zhai et al., 15 Feb 2026) in Section 2.2 (HLE as an evaluation substrate)