Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

120 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Humanity’s Last Exam: Benchmarking AI Frontiers

Updated 5 July 2025

Humanity’s Last Exam is a multi-modal benchmark designed to assess the upper limits of AI on closed-ended, expert-level academic questions.
It is curated by global experts using strict review processes and automated scoring to ensure precise, unambiguous challenges across diverse disciplines.
Evaluations reveal that current state-of-the-art AI models perform significantly below expert human levels, highlighting key gaps in reasoning and uncertainty calibration.

Humanity’s Last Exam refers to a class of rigorous, frontier benchmarks—typified by the Humanity’s Last Exam (HLE) dataset—that are engineered to precisely measure the upper limits of artificial intelligence systems, particularly LLMs, on closed-ended academic tasks. The concept emerges from a context in which prior benchmarks, such as MMLU and coding leaderboards, have been saturated by LLMs achieving human-expert or superhuman performance, thus creating the need for a new gold standard to evaluate, monitor, and regulate progress at the frontier of human knowledge.

1. Motivation and Definition

"Humanity’s Last Exam" is a multi-modal benchmark designed to serve as the apex test of LLM capabilities on closed-ended academic questions. It deliberately positions itself as the “final” closed-ended benchmark of its kind, constructed by a global consortium of subject-matter experts. The dataset consists of 2,500 questions that cover dozens of disciplines, including advanced mathematics, natural sciences, humanities, computer science, and arts.

The defining characteristics of HLE are:

Each question is precise, unambiguous, and resists solution via internet retrieval.
The focus is on world-class questions at the edge of human expertise, verified for difficulty using both leading LLMs and human review.
All questions are formatted for automatic grading, either as multiple-choice (with five or more plausible distractors) or as exact-match short answers.

HLE is intended not merely as a diagnostic tool for current systems, but as a common baseline for tracking progress, guiding research agendas, and informing policymaking about frontier AI capabilities.

2. Dataset Composition and Review Process

HLE encompasses broad subject coverage, targeting over one hundred academic areas. Its compilation process features several stages:

Global recruitment of field experts to author questions meeting strict technical and difficulty standards.
A multi-stage review pipeline: preliminary tests against advanced LLMs to screen out questions that are answerable by retrieval or simple heuristics; subsequent expert vetting to ensure solution correctness, unambiguity, and genuine expert-level challenge.
The dataset divides into roughly 80% exact-match (closed-form or short answer) questions and 20% multiple-choice, allowing for robust, automated and reproducible evaluation.

Table: HLE question types

Type	Fraction	Example (as described in the data)
Exact-match	~80%	Provide the Euler characteristic of a given manifold
Multiple-choice	~20%	Identify the correct legal doctrine from five options

Questions span from technical mathematics to nuanced history, legal doctrine, and scientific reasoning that require multi-step, symbolic, or deeply contextual understanding.

3. Model Evaluation and Observed Performance

Evaluations of top multi-modal LLMs—including GPT-4o, Grok 2, Claude 3.5 Sonnet, Gemini 1.5 Pro, Gemini 2.0 Flash Thinking, o1, and DeepSeek-R1—demonstrate that current state-of-the-art systems perform significantly below expert human levels on HLE. Reported accuracies on the public set of questions are as follows (see Table 1 in the source):

Model	Accuracy (%)
GPT-4o	~9.4
Grok 2	~7.3
Claude 3.5	~5–7
Gemini 1.5 Pro	~5–7
o1	~4.8
DeepSeek-R1	~3.3

In addition to poor accuracy, models exhibit pronounced calibration errors: when required to state their confidence (e.g., “Confidence: 0–100%”), models often display excessive overconfidence, with RMS calibration errors exceeding 80–90%. This combination reveals that current leaders in the LLM field not only fail to solve expert-level tasks at scale but also struggle to reliably assess their own uncertainty.

4. Benchmarking Strategies and Evaluation Protocols

Evaluation on HLE leverages strict formatting and automated scoring. Exact-match tasks require the extraction of a single, succinct answer string for comparison to a canonical solution; multiple-choice questions demand the explicit selection of one option, preventing ambiguity in grading.

System prompts for evaluation enforce a structured output, such as:

1
2
3

Explanation: <detailed reasoning>
Exact Answer: <succinct answer>
Confidence: <%>

This rigid protocol ensures that evaluation is objective, machine-checkable, and tractable at scale. Many questions utilize formal notation, including LaTeX, to maintain precision on mathematical or scientific content. For instance, mathematical identities or proofs are rendered in LaTeX to avoid ambiguity.

The dataset’s multi-modal nature allows for both text and image-accompanied prompts, further expanding its scope relative to standard benchmarks.

5. Research and Policy Implications

HLE provides a transparent, public metric that exposes the persistent gap between contemporary LLMs and expert human performance on non-retrieval, non-generic academic reasoning. It establishes a high water mark for AI evaluation—enabling the setting of concrete goals for research (e.g., improved reasoning, calibration, and domain transfer) and offering policymakers an empirical reference point for AI system assessment.

In educational technology, HLE functions as a stress-test for intelligent tutoring and grading systems, with the promise that advances on HLE could translate into more robust and trustworthy academic AI tools. In research, it pinpoints domains and question types where models most severely underperform, suggesting targeted directions for model architecture, training data, and evaluation-method refinement.

A plausible implication is that advances on HLE (or similar "last" benchmarks) may serve as signposts for AI alignment, safety strategies, and regulation, given their demonstrable measurement of “frontier” capability rather than “average” task performance.

The logical next step to HLE is the emergence of similarly constructed domain-specific "last exams." For instance, Humanity’s Last Code Exam (HLCE) focuses on elite code generation for competitive programming problems, and Humanity’s Last Exam: Biomedicine evaluates biomedical capabilities. These benchmarks mirror the core HLE properties: selection for high difficulty, precise verification, and resistance to superficial retrieval.

Empirical results across these related exams consistently show that, while models may exhibit superhuman performance on "standard" benchmarks, they remain far below expert human competition on these last-exam datasets. This reinforces the notion that the creation and public availability of rigorous, expert-curated benchmarks remain essential for tracking real progress in AI capabilities.

7. Conclusion

Humanity’s Last Exam encapsulates the highest standards of AI evaluation on closed-ended academic tasks, marking a paradigm shift in how AI progress is measured at the edge of human knowledge. Its low model accuracies and high calibration errors reveal both the achievements and stark remaining challenges for AI in expert reasoning. HLE’s design and role underscore the need for ongoing development of demanding, unambiguous benchmarks to not only challenge current systems, but also to inform both technical progress and sociotechnical governance in the age of advanced artificial intelligence.

PDF Markdown Chat (Upgrade)