Capability floor of frontier LLMs on HLE
Determine the true capability floor of frontier large language models on the Humanity’s Last Exam benchmark, specifying the baseline performance level these models can reliably achieve on this dataset despite stochasticity and chance guessing.
References
However, we stress the true capability floor of frontier models on the dataset will remain an open question and small inflections close to zero accuracy are not strongly indicative of progress.
— Humanity's Last Exam
(2501.14249 - Phan et al., 24 Jan 2025) in Section 4.2 (Quantitative Results), paragraph "Accuracy"