Extent of Genuine Capability vs. Benchmark Defects in HLE Performance Differences
Determine the extent to which observed cross-model performance differences on the original Humanity's Last Exam (HLE) benchmark are attributable to genuine model capability gaps rather than sensitivity to benchmark defects such as ambiguous or underspecified problem statements, incorrect reference answers, or inconsistencies between rationales and final answers.
References
To date, such concerns have not been systematically quantified for HLE, nor has their impact on evaluation outcomes been rigorously characterized. Consequently, it remains unclear to what extent observed performance differences on HLE reflect genuine capability gaps versus sensitivity to benchmark defects.
— HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam
(2602.13964 - Zhai et al., 15 Feb 2026) in Section 2.2 (HLE as an evaluation substrate)