Humanity’s Last Exam Benchmark
- Humanity’s Last Exam is a comprehensive closed-ended benchmark that challenges AI with 2,500 expert-level questions spanning over 100 academic disciplines.
- It employs rigorous multi-modal assessments and extensive expert reviews to overcome retrieval shortcuts and expose significant AI reasoning gaps.
- The benchmark drives AI research by inspiring multi-agent, tool-augmented workflows that improve calibration, problem decomposition, and overall reasoning performance.
Humanity’s Last Exam denotes both a conceptual and technical milestone at the frontier of human knowledge, primarily embodied by the Humanity’s Last Exam (HLE) benchmark. HLE is framed as the definitive closed-ended academic benchmark engineered to test LLMs on rigorous, expert-level questions that span the breadth and depth of advanced human academic expertise. Designed by a broad coalition of subject-matter experts globally, the benchmark crystallizes the challenge of objectively assessing whether AI systems can match or outpace the best of human intelligence in closed-form academic reasoning. HLE, and related constructs like the Humanity's Last Code Exam (HLCE), are central to the current evaluation and development of general-purpose AI and scientific agents.
1. Motivations and Origins
HLE was introduced in response to saturation and declining utility of earlier LLM benchmarks such as MMLU, where state-of-the-art models surpassed 90% accuracy, rendering further progress difficult to evaluate. With LLMs rapidly closing the gap on existing tests, there was a critical need for a new benchmark that would retain its discriminative power as models advanced further. HLE is described as the intended “final closed-ended academic benchmark,” representing a conceptual boundary for measuring model performance on tasks that remain firmly in the domain of top human experts (Phan et al., 24 Jan 2025).
2. Structure and Content
The HLE benchmark comprises 2,500 questions spanning over a hundred academic disciplines—mathematics, natural sciences, humanities, social sciences, law, and more. Key features include:
- Question Types:
- Exact-match: Models must produce succinct answers exactly matching gold solutions.
- Multiple-choice: Typically with 5 or more options, with question approval contingent on LLM performance being no better than chance.
- Review Process:
- Questions are screened in several phases. An initial automated LLM difficulty check (over 70,000 attempts) is followed by expert review in two rounds, ensuring that final items demand expert reasoning beyond retrieval or memorization.
- Multi-Modality:
- About 10% of questions require integration of textual and visual information.
- Resistance to Retrieval:
- Problems are crafted to avoid simple look-up, with explicit solutions, unambiguous grading, and no reliance on extraneous web queries (Phan et al., 24 Jan 2025).
3. LLM Performance and Metrics
Contemporary state-of-the-art LLMs display notably low performance on HLE. Recent evaluations report:
Model | HLE Accuracy |
---|---|
GPT-4o | 3.3% |
Grok 2 | 3.8% |
Gemini 2.0 Flash | 6.2% |
Other frontier LLMs | 9–10% |
Calibration—the alignment between predicted confidence and empirical accuracy—is also critically poor, with RMS calibration errors frequently exceeding 90%. Models tend to provide confidently incorrect outputs rather than indicate uncertainty, underscoring a substantial gap in reliable AI reasoning at the highest academic level (Phan et al., 24 Jan 2025).
An illustration (referenced as Figure 1 in the source) shows HLE’s sharply increased difficulty relative to previous benchmarks.
4. Implications for AI Research and Policymaking
HLE sets a new standard for evaluating progress in AI reasoning and domain knowledge:
- Research Impact:
- HLE challenges models to advance in multi-modal reasoning, deep problem decomposition, and self-calibration.
- It exposes the limitations of current training and inference paradigms in generating expert-level, closed-form answers.
- HLE provides a robust platform for iterative improvement, as model scores well below expert human baselines leave substantial margin for growth (Phan et al., 24 Jan 2025).
- Policy Relevance:
- HLE offers an objective, public, and quantifiable measure of how close (or far) advanced models are from expert human generalists.
- It is therefore suited as a reference for governance, safety, and societal impact discussions—notably, claims about “AGI” capabilities can now be referenced against a gold standard.
5. Extensions: Humanity’s Last Code Exam (HLCE)
HLCE is an analog to HLE in the domain of program synthesis and algorithmic reasoning (Li et al., 15 Jun 2025). HLCE consists of 235 highly challenging problems drawn from finals of the International Collegiate Programming Contest (ICPC) and International Olympiad in Informatics (IOI). Models such as o4-mini(high) and Gemini-2.5 Pro achieve pass@1 rates of just 15.9% and 11.4% respectively. The benchmark incorporates interactive challenges and a unique “self-recognition” task, which measures a model’s ability to assess the correctness of its own code:
Model | pass@1 (%) | Self-Recognition AUC |
---|---|---|
o4-mini(high) | 15.85 | 0.63 |
ChatGPT-4o | 8.08 | 0.84 |
Self-recognition performance is only weakly correlated with code-generation skill, highlighting shortcomings in AI introspection. Test-time scaling laws indicate that longer or more thoughtful generations continue to improve results, signaling further improvement potential even for current models (Li et al., 15 Jun 2025).
6. Scientific AI Agents and Agentic Workflows
Recent work evaluates HLE not only as a static benchmark but as a proving ground for multi-agent, tool-augmented, scientific workflows (Chai et al., 7 Jul 2025). The X-Master agent, an open-source tool-augmented reasoning system, combines standard LLM inference with dynamic Python code execution, web search, and document parsing:
- Solver: Generates solutions, invoking code or tools as needed.
- Critic and Rewriter: Iteratively refine candidate solutions with analytic feedback.
- Selector: Chooses the best output among generated candidates.
This “scattered-and-stacked” workflow yielded a SOTA score of 32.1% on HLE, surpassing prior closed-source records (26.6% by OpenAI, 26.9% by Google DeepMind Deep Research). Each incremental module in the workflow (Solver, Critic, Rewriter, Selector) delivered substantial gains, with results detailed in tabular form in the source (Chai et al., 7 Jul 2025).
7. Future Perspectives
Humanity’s Last Exam represents a pivotal checkpoint in the quest for generalist AI. While routine academic benchmarks are approaching saturation, HLE and its extensions reassert meaningful evaluation by foregrounding questions that resist shortcut learning and retrieval-based solutions. Recent breakthroughs in tool-augmentation and agentic workflows demonstrate that meaningful progress—measured as substantial percentage-point gains on HLE—is achievable, especially when models are empowered to leverage external tools, web knowledge, and iterative refinement.
A plausible implication is that continued improvements will demand not only larger and richer models but the integration of agentic, multi-step workflows that simulate the exploratory dynamics of human reasoning and scientific inquiry. The release and adoption of HLE and HLCE as universal benchmarks are expected to refine not only the measurement of AI capability but the design of future systems themselves.
References
- "Humanity's Last Exam" (Phan et al., 24 Jan 2025)
- "Humanity's Last Code Exam: Can Advanced LLMs Conquer Human's Hardest Code Competition?" (Li et al., 15 Jun 2025)
- "SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity's Last Exam?" (Chai et al., 7 Jul 2025)