HLE: Humanity’s Last Exam

Updated 7 August 2025

HLE is a multifaceted concept that serves as both an AI benchmark with expert-level exam questions and a metaphor for assessing civilization's response to existential threats.
It employs a rigorously curated, multi-modal exam format with exact-match and multiple-choice questions to expose limitations in advanced models via quantitative metrics such as sub-10% accuracy.
HLE emphasizes the urgent need for improved safety measures and transformative educational reforms to bridge the cognitive gaps between human expertise and machine learning.

Humanity’s Last Exam (HLE) is a multifaceted concept emerging in the context of advanced artificial intelligence, the existential risks confronting the species, and the limits of human and machine cognition. As both metaphor and operational benchmark, HLE encapsulates the scientific, technological, sociological, and epistemological “final test” humanity faces at the boundaries of its capabilities and survival. It refers simultaneously to: (1) a class of high-difficulty evaluation benchmarks intended to measure AI systems at or beyond the human expert frontier, (2) the ensemble of existential threats for which civilization’s ability to adapt and respond is “examined,” and (3) philosophical frameworks interrogating the nature and sufficiency of human information processing in the digital age. The term is concretized through rigorous datasets and methodologies, notably the eponymous “Humanity’s Last Exam” benchmark for LLMs, but is also referenced more broadly in simulation theory, existential risk research, and the pedagogical transformation wrought by AI.

1. Conceptual Foundations and Definitions

HLE as a formal benchmark is motivated by the saturation of previous AI evaluation datasets—for example, models achieving >90% accuracy on MMLU—leading to a need for harder, expert-level tests that systematically challenge both AI and human participants (Phan et al., 24 Jan 2025). At a more theoretical level, “Humanity’s Last Exam” signifies the point at which humanity is collectively tested on whether it can preserve its function and agency in the face of high-impact threats and novel digital pathologies (Softky, 2017, Jiang et al., 2022, Jiang et al., 2022).

This dual framing—one operational, the other existential—yields definitions that lie at the intersection of academic rigor (in benchmarking and evaluation) and global survival (in risk and resilience). HLE thus serves as both a yardstick for advanced cognition (human and machine) and a metaphor for the comprehensive, perhaps final, assessment of civilization under unprecedented conditions.

2. The HLE Academic Benchmark: Structure, Development, and Coverage

The core instantiation of HLE as an academic benchmark is a multi-modal, globally curated suite of 2,500 challenging exam-style questions spanning mathematics, natural sciences, engineering, humanities, and trivia (Phan et al., 24 Jan 2025). Its distinguishing features include:

Format: 80% exact-match short answer (automatically graded), 20% multiple-choice (≥5 options); 10% of questions are multi-modal (requiring image and text integration).
Difficulty: Only items which “stump” state-of-the-art LLMs (i.e., performance below random chance for MCQs, or far below human expert levels) are included after filtering.
Curation: Developed by over 1,000 subject-matter experts from 500+ institutions, with multi-round peer review for clarity, ambiguity, and non-searchability.
Verification: All answers are unambiguous and easily verifiable but resist solution via memorization or retrieval.
Incentives: Open access, co-authorship, and monetary prizes for top question contributions ensure ongoing quality control and dataset expansion.

This benchmark is engineered to probe the “upper frontier” of formal reasoning, technical knowledge, and multi-step problem-solving, going well beyond current general benchmarks. It is intentionally designed as a “final closed-ended academic benchmark” to illuminate the performance floor of LLMs and prevent the field from mistaking benchmark saturation for comprehensive progress.

3. AI Benchmark Results, Model Limitations, and Calibration

Evaluation of state-of-the-art LLMs (including GPT-style architectures from OpenAI and research models from Google DeepMind) on HLE has revealed critical limitations:

Models typically achieve <10% accuracy, with leading open-source agentic workflows (e.g., X-Master) only breaching the 30% threshold for the first time (Chai et al., 7 Jul 2025).
Confidence calibration is poor on HLE; models output incorrect answers with confidence grossly disconnected from observed accuracy, as shown by RMS calibration errors >80–90% (Phan et al., 24 Jan 2025). The RMS calibration error is formally defined as

$\text{RMS}_\text{CE} = \sqrt{\frac{1}{N} \sum_{i=1}^N (\text{confidence}_i - \text{accuracy}_i)^2}$

Specialized agentic architectures that integrate code generation, external tool APIs, and multi-agent critique (e.g., X-Master, STELLA, CLIO) offer significant but still partial gains, reaching ~32% (X-Master), 26% (STELLA), and 22% (CLIO, biology/medicine subset) accuracy (Chai et al., 7 Jul 2025, Jin et al., 1 Jul 2025, Cheng et al., 4 Aug 2025).
Scaling laws at inference time demonstrate only sublinear improvement with extended computation, indicating fundamental reasoning and generalization bottlenecks even in high-capacity models (Li et al., 15 Jun 2025).

These results expose a significant discrepancy between prior benchmark performance and expert-level reasoning, reinforcing the significance of HLE as a true stress test for advanced AI.

4. HLE in the Context of Existential and Civilizational Risk

Outside the computational domain, “Humanity’s Last Exam” operates as a unifying metaphor for the critical, convergent challenges—termed existential risks or “Great Filters”—that threaten human continuity (Jiang et al., 2022, Jiang et al., 2022). Probabilistic models from the literature assign survival horizons of:

Threat	Mean Survival Estimate
Nuclear war	~60 years
Pandemics	~16 years
Climate change	~193 years
Asteroid impact	~1754 years
AI takeover	~40 years (with uncertainty)

These statistical estimates are underpinned by models leveraging polynomial extinction probabilities, Taylor series, and regression over technological indicators (e.g., FLOPS compared to biological brains). The HLE metaphor thus analogizes existential risk mitigation to a collective, high-stakes examination, where failure results in irreversible loss of agency or extinction.

Policy and governance implications are prominent: to “pass” HLE in this sense requires coordinated international action, aggressive risk mitigation (across arms control, biosafety, climate, AI regulation, and planetary defense), and deep reforms in education and societal behavior to promote long-term resilience.

5. Informational and Cognitive Perspectives

A separate but connected thread frames HLE through the lens of information biology and the informational structure of life (Softky, 2017). This paradigm asserts that human cognitive and sensorimotor function is decaying under the impact of low-entropy, temporally degraded digital signals—threatening the calibration required for adaptive intelligence. The HLE here is reconceptualized as humanity’s struggle to:

Maintain the balance between selection-amplification (narrowing) and mutation-diffusion (broadening) forces in evolving information systems.
Restore high-bandwidth, high-entropy sensory environments (via “paleo” practices, social resonance, and embodied interaction).
Counteract negative feedback loops—termed “pinging” epidemics—arising from mediated digital communication, which degrade both individual and societal homeostasis.
Foster environments where affection, trust, and entropy-rich stimuli reacquire primacy over digitally induced fragmentation.

Mathematically, models propose:

$\Delta I \propto \frac{1}{\Delta t}; \quad \text{Stability} \propto \text{Entropy} \times \text{Affection}$

indicating that rich, temporally precise information coupled with resonance is foundational for resilient cognition.

6. Epistemological and Pedagogical Implications

HLE has redefined the boundaries of both machine and human evaluation. In AI pedagogy and exam generation:

LLMs can generate and solve complex multi-part exam questions (“From Human Days to Machine Seconds” (Drori et al., 2022)), automating what once required expert human labor. However, this demands new educational strategies, such as embedding meta-level critical thinking tasks and “examining the answers” for correctness and originality, thus training students to engage critically with AI-generated outputs.

In information retrieval and QA systems, HLE analogs have fostered “exam-based” evaluation frameworks that focus on answerability and coverage of key informational facets, moving away from mere passage relevance towards comprehensive content assessment (Farzi et al., 2024).

7. Future Directions and Open Challenges

The current state of HLE benchmarking and existential analysis reveals a research and societal agenda characterized by:

Ongoing need for more robust, multi-step reasoning architectures capable of interacting with complex external tools, adapting via dynamic cognitive loops, and transparently exposing their decision-making for human oversight (Jin et al., 1 Jul 2025, Cheng et al., 4 Aug 2025).
Imperative to bridge the calibration gap, such that high-confidence predictions map accurately onto observed performance.
Integration of agentic systems (e.g., X-Master, STELLA, CLIO) as both research testbeds and practical partners in science, engineering, and biomedical discovery.
Synthesis between risk management research and cognitive/AI evaluation: progress in one area informs strategic imperatives in the other.

Collectively, HLE stands as a focal point for the intersection of advanced benchmarking, human–AI collaboration, existential risk mitigation, and the epistemic limits of computation and society. Its dual operational and metaphorical usage not only drives progress in artificial intelligence but also crystallizes the challenges of maintaining function, integrity, and agency at the edge of human capability.