True capability floor of frontier models on HLE
Determine the true capability floor of frontier large language models on Humanity’s Last Exam (HLE), rigorously separating genuine model capability from inference noise and multiple-choice guessing effects that produce non-zero accuracy near zero, so that small fluctuations around near-zero performance are not misinterpreted as substantive progress.
References
However, we stress the true capability floor of frontier models on the dataset will remain an open question and small inflections close to zero accuracy are not strongly indicative of progress.
— Humanity's Last Exam
(2501.14249 - Phan et al., 24 Jan 2025) in Section 4.2 (Quantitative Results), Accuracy paragraph