Humanity's Last Exam (HLE) Benchmark

Updated 15 November 2025

Humanity's Last Exam benchmark is a rigorous, closed-book test featuring 2,500 expert-curated questions over 100+ subdomains to gauge advanced LLM reasoning.
It emphasizes graduate-level difficulty and resists trivial internet lookup by requiring multi-step, domain-expert problem solving.
Evaluation protocols use automated pass@1 accuracy and calibration metrics, highlighting substantial performance gaps between current LLMs and human experts.

Humanity’s Last Exam (HLE) is an advanced closed-book academic benchmark designed as a definitive yardstick for evaluating the frontier capabilities of LLMs and research agents. Consisting of 2,500 expert-curated short-answer questions spanning more than 100 subdomains—mathematics, sciences, humanities, and engineering—HLE establishes a rigorous challenge at or above graduate level. Unlike earlier benchmarks that have become saturated (e.g., MMLU), HLE’s question design explicitly resists trivial internet lookup, focusing on questions answerable by domain experts but typically beyond current LLM generalization and reasoning abilities. Scores on HLE remain substantially below human expert levels, with the best contemporary systems approaching 30–50% accuracy under tool-augmented or multi-agent protocols, while individual human experts exceed 98% in controlled settings.

1. Benchmark Motivation and Historical Context

The inception of HLE was motivated by the rapid rise in LLM performance, which rendered legacy benchmarks like MMLU ineffective for measuring incremental advances, as SOTA models routinely surpass 90% accuracy. HLE was developed as a “final” closed-ended academic benchmark prioritizing:

Broad subject representation, with 2,500 questions sourced from mathematics (450), physics (220), computer science (210), chemistry (160), biology (140), humanities (180), engineering (200), law/policy (100), trivia/puzzles (120), and over 90 other specialist areas (Phan et al., 24 Jan 2025).
Graduate-level or specialist difficulty: Each item vetted by ≥1,000 subject-matter experts from 500+ institutions, ensuring human-expert “ceiling” (≳98% accuracy).
Resistance to superficial retrieval: Solutions are constructed so that answers cannot be trivially found online.

The benchmark's release fostered a paradigm shift toward high-difficulty, expert-only benchmarks, providing a robust test bed as reflected in subsequent agentic and reasoning LLM research.

2. Dataset Composition and Question Design

HLE’s question bank comprises 2,500 standalone items, available primarily in short-answer (exact-match string) format (~80%), with the remainder in multiple-choice (≥5 options, ~20%) and ∼10% supporting multimodal input (text + image) (Phan et al., 24 Jan 2025, Vanhoyweghen et al., 19 Aug 2025). Each question was filtered to ensure:

Unambiguous, single-answer solutions.
Eligibility for automated grading—i.e., string matching, numeric normalization (fractions to decimals), and LLM judge verification.
Task formulation requiring one or more steps of domain-knowledge reasoning (“Prove that every subgroup of index 2 is normal,” “What is the social impact of the 1954 Brown v. Board of Education decision?”).
Explicit resistance to trivial lookup (content not indexed online at release).

For agent-research evaluations, each question is issued verbatim to the agent; all search queries and retrieval artifacts are logged for verification (Han et al., 12 Aug 2025). A subset (“Bio/Chem Gold,” N=149) is used for advanced biological/chemical reasoning challenges (Tang et al., 25 Sep 2025, Cheng et al., 4 Aug 2025).

3. Evaluation Protocols and Metrics

HLE evaluation relies on automated, LLM-powered judgment with strict matching and normalization protocols. The primary metric is pass@1 accuracy:

$\text{Accuracy} = \frac{1}{N} \sum_{i=1}^N c_i$

where $c_i \in \{0,1\}$ is the correctness indicator by the “HLE Judge”—typically OpenAI o3-mini (temperature=1.0, max_completion_tokens=4096) or analogous automated agent (Phan et al., 24 Jan 2025, Han et al., 12 Aug 2025, Chen et al., 28 Oct 2025). Confidence intervals are computed via bootstrapping. For comparative studies, Bayesian credible intervals (Beta priors/posteriors) are employed (Spelda et al., 13 Aug 2025). Calibration is assessed using root-mean-square calibration error (RMSCE):

$\mathrm{RMSCE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (c_i - y_i)^2} \times 100\%$

Additionally, categories such as chain-of-thought length, intra-chain sentiment volatility, and lexicographic hedging signals are analyzed as post-hoc calibration proxies (Vanhoyweghen et al., 19 Aug 2025).

4. Empirical Results and Performance Trends

Performance on the public text-only HLE split (N≈2,250) demonstrates persistent capability gaps between LLMs and expert humans:

Model	Accuracy (%)	RMSCE (%)
GPT-4o	2.9	90.4
Grok 2	3.9	92.5
Claude 3.5 Sonnet	4.2	87
Gemini 1.5 Pro	4.8	91.1
DeepSeek–R1	9.4	81.8
o1	8.9	92
AgentFrontier-30B-A3B	28.6	n.a.
SFR-DR-20B	28.7	n.a.
X-Masters	32.1	n.a.
Eigen-1 (Bio/Chem)	48.3 (N=149)	n.a.

Human expert accuracy (time-controlled) reliably exceeds 98%. Agentic and multi-agent pipelines (X-Masters, Eigen-1) and curriculum fine-tuning approaches established new state-of-the-art results, with continual improvement (from <10% to >30%) attributed to explicit tool-use, multi-agent orchestration, and zone-of-proximal-development guided training (Chen et al., 28 Oct 2025, Tang et al., 25 Sep 2025, Nguyen et al., 8 Sep 2025, Chai et al., 7 Jul 2025). However, calibration error on HLE remains high (>78%), indicating overconfidence even with incorrect answers (Vanhoyweghen et al., 19 Aug 2025).

5. Agentic Methods, Multi-Agent Workflows, and Scaling Effects

Test-time workflows reflect the complexity of HLE’s reasoning demands:

Multi-agent orchestration: Scattering (diversity via parallel solver agents) and stacking (depth via critic/rewriter/selector agents) boost pass@1 by up to 32.1% (X-Masters) (Chai et al., 7 Jul 2025).
Self-adaptive reasoning frameworks: CLIO (Cognitive Loop via In-Situ Optimization) equips vanilla LLMs with recursive reasoning, uncertainty monitoring, and belief-graph aggregation, yielding 22.37% on Bio/Med questions and up to ≈30.27% with further ensemble “MoreThinking” (Cheng et al., 4 Aug 2025).
Benchmark-driven selection/curriculum effects: Models fine-tuned or reinforced on HLE tasks improve substantially relative to non-exposed counterparts, suggesting that benchmarks themselves function as adaptive curricula (Spelda et al., 13 Aug 2025, Chen et al., 28 Oct 2025).
Tool integration: Candidates use Python libraries, web search, and custom code interpreters to enhance reasoning (Chai et al., 7 Jul 2025, Nguyen et al., 8 Sep 2025, Chen et al., 28 Oct 2025). Usage statistics indicate more tool invocations and refined reasoning for higher-performing agents.

Scaling studies (pass@N, best-of-N sampling) show increased accuracy at linear compute cost growth, peaking at ~37% with best-of-3 strategies under o3-high (Drori et al., 14 Feb 2025). However, inference costs and question complexity limit practical scaling.

6. Data Contamination, Integrity and Auditing

Search-time data contamination (STC) emerges as an acute concern for benchmark validity. STC occurs when web-retrieval agents surface exact or near-duplicate benchmark items with answers—most often from open repositories like HuggingFace—enabling agents to copy instead of genuinely infer (Han et al., 12 Aug 2025). Empirical results:

Agent	Contamination Rate	Accuracy (contaminated)	Accuracy (clean)	Accuracy Drop ΔA (blocked)
Sonar Pro	3.36%	55%	43%	≈12%
Sonar Deep Research	3.36%	60%	38%	≈22%

Blocking contaminated sources (e.g., huggingface.co) yields a ∼15% accuracy reduction on previously contaminated items. Recommended mitigation includes multi-stage filters, canary strings, time-cutoff enforcement, full trajectory logging, and real-time substring audits. It is necessary to report contamination rates and post-mitigation accuracy to maintain benchmark integrity.

7. Limitations, Recommended Practices, and Future Directions

HLE is strictly closed-ended and does not assess a system’s open-ended research creativity, interactive learning, or adaptive teaming. Limitations include:

Saturation of “closed-book” question types may eventually raise obsolescence risks as model generalization increases (Phan et al., 24 Jan 2025, Han et al., 12 Aug 2025).
Persistent calibration errors suggest the need for improved uncertainty measures.
Contamination risks necessitate transparent filtering, retrieval audits, and centralized logging for trustworthy benchmarking.

Best practices advocated across sources:

Editor’s term: “Swiss cheese filtering”—layered domain/date/content filters on retrieval sources.
Explicit reporting of search configurations and mitigation experiments.
Public release of retrieval logs for external auditing.
Post-hoc error analysis differentiating reasoning vs. knowledge gaps (Eigen-1: failure overlap >85%) (Tang et al., 25 Sep 2025).

Future benchmarks are expected to couple HLE’s static closed-ended difficulty with dynamic, open-ended tasks—interactive theorem proving, code synthesis, longitudinal research workflows—providing more holistic evaluations of agentic intelligence (Phan et al., 24 Jan 2025). Standardizing agentic monitoring, calibration-by-lexical-hint detection, and hybrid human–AI steering protocols remain active areas for further research.

In summary, Humanity's Last Exam delivers a technically rigorous, high-difficulty challenge that remains unsolved by existing LLMs and agents. It serves as both an evaluation suite and—through its curricular influence—a driver of frontier model and agent development. Its explicit focus on closed-ended reasoning, rigorous data integrity, and comprehensive subject coverage make HLE a core asset for advancing and safely assessing state-of-the-art AI capabilities.