True capability floor of frontier models on HLE

Determine the true capability floor of frontier large language models on Humanity’s Last Exam (HLE), rigorously separating genuine model capability from inference noise and multiple-choice guessing effects that produce non-zero accuracy near zero, so that small fluctuations around near-zero performance are not misinterpreted as substantive progress.

Background

Humanity’s Last Exam (HLE) is designed to challenge frontier LLMs with difficult, closed-ended questions that current systems largely fail. Despite strict filtering intended to exclude questions solvable by existing models, evaluations still show non-zero accuracy, which the authors attribute to inference noise, inconsistent guessing, and below-random performance on multiple-choice items.

Because these stochastic effects can yield small but non-zero scores, the authors caution that modest changes near the zero-accuracy regime may not reflect real capability gains. Precisely quantifying the true capability floor—the level of performance attributable to genuine competence rather than noise—remains unresolved and is important for interpreting progress on HLE over time.

References

However, we stress the true capability floor of frontier models on the dataset will remain an open question and small inflections close to zero accuracy are not strongly indicative of progress.

— Humanity's Last Exam (2501.14249 - Phan et al., 24 Jan 2025) in Section 4.2 (Quantitative Results), Accuracy paragraph

True capability floor of frontier models on HLE

Background

References

Related Problems