Humanity's Last Exam (2501.14249v7)

Published 24 Jan 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Benchmarks are important tools for tracking the rapid advancements in LLM capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.

Summary

The paper presents HLE, a new benchmark designed to rigorously evaluate LLMs using diverse, closed-ended academic questions.
It employs a multi-stage review process with expert evaluations to ensure question difficulty, verifiability, and adherence to strict quality standards.
The evaluation shows LLMs scoring below 10% accuracy with high confidence, revealing significant gaps between current AI models and expert human performance.

The paper "Humanity's Last Exam" introduces HLE (Humanity's Last Exam), a multi-modal benchmark designed to evaluate LLM capabilities at the frontier of human knowledge. The benchmark addresses the saturation of existing benchmarks like MMLU, where state-of-the-art LLMs achieve over 90% accuracy, limiting the ability to measure advancements in LLM capabilities accurately. HLE is designed to be the final closed-ended academic benchmark of its kind with broad subject coverage.

HLE comprises of a number of questions across diverse subjects, including mathematics, humanities, and the natural sciences. Developed globally by subject-matter experts, the benchmark features multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but not quickly answered via internet retrieval. Current LLMs demonstrate low accuracy and poor calibration on HLE, highlighting a gap between their capabilities and expert human performance on closed-ended academic questions. The dataset is publicly released to inform research and policymaking.

The paper details the benchmark's development, emphasizing a multi-stage review process to ensure question difficulty and quality. Before submission, each question is tested against state-of-the-art LLMs to verify its difficulty; questions are rejected if LLMs can answer them correctly. Submitted questions undergo a two-stage review process: an initial feedback round with multiple graduate-level reviewers and organizer and expert reviewer approval, ensuring quality and adherence to submission criteria. A public review period is planned post-release to gather community feedback and correct any issues in the dataset.

The paper evaluates state-of-the-art LLMs on HLE, revealing low accuracy (less than 10\%) across models and high Root Mean Square (RMS) calibration errors (above 80\%). This indicates that models provide incorrect answers with high confidence rather than acknowledging uncertainty, highlighting confabulation.

The benchmark includes two question formats: exact-match questions, where models provide an exact string as output, and multiple-choice questions, where the model selects one of five or more answer choices. Ten percent of questions require comprehending both text and an image reference. Strict submission criteria ensure questions are precise, unambiguous, solvable, and non-searchable, preventing reliance on memorization or simple retrieval methods. Submissions must be original work or syntheses of published information, requiring graduate-level expertise or knowledge of highly specific topics. \LaTeX notation is supported, and answers are kept short and verifiable for automated grading.

To attract high-quality submissions, a \$500,000 **USD** prize pool was established, with prizes of \$5,000 USD for each of the top 50 questions and \$500 USD for each of the next 500 questions, as determined by organizers. This incentive structure, combined with the opportunity for paper co-authorship, aims to draw participation from qualified experts.

The paper notes that over 70,000 attempts were logged during question validation, resulting in approximately 13,000 questions that stumped LLMs and were forwarded to expert human review. Reviewers, holding graduate degrees in their fields, score submissions against standardized rubrics and offer feedback. The review process involves two rounds: iterative refinement with 1-3 reviews per question, and a second round where good and outstanding questions are identified and approved by organizers and reviewers for inclusion in the final HLE dataset.

In the evaluation setup, the final HLE dataset was evaluated on additional frontier multi-modal LLMs using a standardized system prompt that structures model responses into explicit reasoning followed by a final answer. GPT-4o was used as a judge to verify answer correctness against model predictions while accounting for equivalent formats (e.g., decimals vs. fractions or estimations).

Key quantitative results include low accuracy across all frontier models, highlighting the gap between current LLMs and expert-level academic capabilities. Low scores are partially by design, as the dataset collection process filters out questions that existing models can answer correctly. Models exhibit non-zero accuracy due to inherent noise in model inference. Poor calibration is observed across all models, reflected in high RMS calibration error scores, with models frequently providing incorrect answers with high confidence.

The paper analyzes the number of completion tokens used across models, revealing that models with reasoning require substantially more compute. The paper emphasizes that future models should not only improve accuracy but also strive for compute optimality.

The paper discusses the potential for rapid benchmark saturation, referencing prior instances where models progressed from near-zero to near-perfect performance in short timeframes. It posits that models could exceed 50% accuracy on HLE by the end of 2025. High accuracy on HLE would demonstrate expert-level performance on closed-ended, verifiable questions and cutting-edge scientific knowledge but would not necessarily imply autonomous research capabilities or AGI (Artificial General Intelligence). HLE tests structured academic problems rather than open-ended research or creative problem-solving abilities.

The paper concludes that HLE provides a clear measure of AI progress, creating a common reference point for scientists and policymakers to assess AI capabilities.

The authors offered optional co-authorship to all question submitters with an accepted question in HLE (including both public and private splits). Authorship order is ranked based on the number of accepted questions in HLE.

The paper includes a detailed appendix that lists the affiliations of data contributors. The top fifty most popular subjects in HLE are listed, although there are over a hundred subjects in the overall dataset.

The paper details the instructions given to human reviewers, including instructions for both review round 1 and review round 2.

PDF Markdown

Follow-up Questions

Related Papers

Authors (1099)

First 10 authors:

Tweets

https://twitter.com/DanHendrycks/status/1886209795734245811

https://twitter.com/DanHendrycks/status/1885726013822525635

https://twitter.com/jwt0625/status/1915924536459202607

https://twitter.com/feulf/status/1886412763653611728

https://twitter.com/varunnarsana/status/1916042524717707730

https://twitter.com/inthewaterwheel/status/1885525764948582479

YouTube

Show All Videos