Capability floor of frontier LLMs on HLE

Determine the true capability floor of frontier large language models on the Humanity’s Last Exam benchmark, specifying the baseline performance level these models can reliably achieve on this dataset despite stochasticity and chance guessing.

Background

Humanity’s Last Exam (HLE) is a multi-modal, closed-ended academic benchmark constructed to stump current state-of-the-art LLMs through expert-authored, carefully reviewed questions. Due to pre-filtering against frontier models and strict review standards, models generally score very low on HLE, yet non-determinism and chance can yield non-zero observed accuracy.

In reporting quantitative results, the authors caution that small fluctuations near zero accuracy may not indicate real progress. They explicitly note that the true capability floor—the minimal, reliable performance level of frontier LLMs on HLE—remains unresolved, motivating a need to precisely characterize this floor to interpret changes in scores near zero.

References

However, we stress the true capability floor of frontier models on the dataset will remain an open question and small inflections close to zero accuracy are not strongly indicative of progress.

— Humanity's Last Exam (2501.14249 - Phan et al., 24 Jan 2025) in Section 4.2 (Quantitative Results), paragraph "Accuracy"

Capability floor of frontier LLMs on HLE

Background

References

Related Problems