Exam-based Benchmarks

Updated 30 August 2025

Exam-based benchmarks are structured evaluation tools that use standardized exam questions to assess AI's knowledge, reasoning, and problem-solving abilities.
They incorporate diverse formats—multiple-choice, open-ended, code-based—and cover disciplines such as science, mathematics, law, and programming.
These benchmarks enable precise comparison against human experts while addressing challenges like data leakage through dynamic sampling and adversarial filtering.

Exam-based benchmarks are structured evaluation instruments that use exam-style questions or tasks—often curated from real educational, professional, or competitive assessments—to systematically probe the knowledge, reasoning, and practical skills of AI systems. These benchmarks span question formats (multiple-choice, short answer, exact match, open-ended, practical code, or multimodal reasoning), subject areas (science, mathematics, law, programming, culture-specific content), and abstraction levels (factual recall, calculation, multi-hop reasoning, and creative problem solving). Their adoption is motivated by the dual goals of (1) providing a transparent, interpretable measure of progress at or above the human-expert frontier and (2) enforcing a rigorously standardized task distribution which supports fine-grained comparison across AI systems, algorithms, and training paradigms.

1. Rationale and Foundational Principles

Exam-based benchmarks provide a mechanism to align AI system evaluation with familiar and societally valued standards, leveraging the infrastructure of educational or professional exams for constructing objective tasks. Their appeal derives from several features:

Standardization: Exam questions are often crafted and vetted by domain experts according to curricular or professional norms, ensuring consistency in difficulty and coverage.
Breadth and Depth: Collections like Humanity’s Last Exam (HLE) (Phan et al., 24 Jan 2025), EESE (Wang et al., 22 Jul 2025), and SciEx (Dinh et al., 14 Jun 2024) span hundreds of subjects and cover a distribution of cognitive tasks from foundational recall to novel synthesis and reasoning.
Automatability: Many exam formats permit automatic grading through exact-match, multiple choice, or precise numeric responses, enabling scalable evaluation and leaderboard tracking.
Human Comparability: Direct comparison with human benchmarks—whether average students, experts, or competition winners—is possible when the same questions are administered to both populations, as in code competitions (Li et al., 15 Jun 2025), legal exams (II et al., 2022), or open university science exams (Dinh et al., 14 Jun 2024).

However, as discussed in (Davis, 2014), standardized exam formats are not without limitations. Importantly, what is challenging for human test-takers may not align with the knowledge structures or failure modes of AI systems, which can exploit dataset idiosyncrasies or lack commonsense reasoning.

2. Construction Methodologies

Exam-based benchmarks can be constructed through several complementary methods, each with corresponding trade-offs in coverage, validity, and resistance to overfitting:

a. Sourcing from Real-World Exams

Benchmarks like SeaExam (Liu et al., 10 Feb 2025) and HLE (Phan et al., 24 Jan 2025) are built by systematically curating questions from official standardized tests, competitive programming contests (as in HLCE (Li et al., 15 Jun 2025)), or university examinations (Dinh et al., 14 Jun 2024). Key validation steps include:

Expert Review: Subject-matter experts vet questions for correctness, clarity, and discriminative power.
Filtering and Adversarial Curation: Questions that can be solved by current state-of-the-art models are removed (HLE (Phan et al., 24 Jan 2025)), ensuring that the benchmark reflects the active research frontier.

b. Synthetic and Automated Exam Generation

Approaches like ExamGAN (Wu et al., 2021) employ generative adversarial networks conditioned on knowledge mastery representations to synthesize exam scripts matching desired distributions of difficulty, topic coverage, and expected discriminability. This supports rapid construction of large-scale, balanced evaluation sets with statistical properties mirroring real student populations.

c. Dynamic and Leakage-Resistant Sampling

The Ever-Evolving Science Exam (EESE) (Wang et al., 22 Jul 2025) introduces a dynamic methodology: a massive, non-public EESE-Pool (>100,000 QA pairs) is periodically sampled to generate public, leakage-resilient evaluation subsets (e.g., 500 instances). This periodic refreshing, combined with non-public pools and adversarial vetting, reduces the risk that models encounter evaluation data during pre-training (“data leakage”), thus preserving the integrity of test-time assessment.

d. Multimodal and Multilingual Coverage

Modern exam benchmarks include images, diagrams, or audio (SciEx (Dinh et al., 14 Jun 2024), SFE (Zhou et al., 12 Jun 2025)) and cover multiple languages (SeaExam (Liu et al., 10 Feb 2025), SciEx (Dinh et al., 14 Jun 2024)). This tests both the multimodal reasoning capabilities and linguistic generalization of advanced AI models.

3. Evaluation Protocols and Grading Mechanisms

Grading in exam-based benchmarks is contingent on the question format and the cognitive targets of interest:

a. Automated Grading

For closed-ended formats (multiple choice, exact match, numeric), automated grading is standard. Typical metrics include accuracy (percent of correct answers), pass@k in code benchmarks (Xie et al., 31 Mar 2024, Li et al., 15 Jun 2025), precision/recall on answerable questions (Farzi et al., 1 Feb 2024), or similarity to gold responses.

b. Human Expert Grading

Freeform and open-ended answers, especially in science or programming exams (SciEx (Dinh et al., 14 Jun 2024), SFE (Zhou et al., 12 Jun 2025)), require human expert evaluation. In SciEx, university lecturers assign partial scores and qualitative feedback, enabling nuanced assessment of depth, correctness, and presentation. This is resource-intensive but yields robust evaluation across subjective and complex domains.

c. LLM-as-a-Judge

To scale freeform grading, “LLM-as-a-judge” (Dinh et al., 14 Jun 2024, Bai et al., 2023) applies strong LLMs, prompted with the question, candidate answer, and reference solution, to assign a score. Chain-of-thought rationales and carefully designed few-shot prompts substantially improve alignment with expert judgment, with some experiments yielding Pearson correlations above 0.94 between LLM and human scores (Dinh et al., 14 Jun 2024).

d. Peer or Decentralized Examination

To mitigate bias in a single examiner (human or model), peer-evaluation frameworks aggregate multiple model judgments (e.g., voting schemes over GPT-4, Claude, Vicuna, etc. (Bai et al., 2023)) or employ pairwise ranking (O(n log n) complexity) to estimate robust overall scores.

4. Benchmark Impact and Model Differentiation

Exam-based benchmarks have been critical in revealing performance ceilings and capabilities:

Difficulty Frontier: Benchmarks like HLE (Phan et al., 24 Jan 2025) and HLCE (Li et al., 15 Jun 2025) show that while foundation models have saturated earlier tests (e.g., >90% on MMLU), their accuracy drops dramatically (often below 10% or pass@1 of ~15%) on adversarially filtered, expert-level questions or advanced programming challenges.
Discipline-Specific Analysis: Comprehensive categorization enables fine-grained competence mapping; e.g., EESE (Wang et al., 22 Jul 2025) can differentiate model strengths (math, physics, biology, etc.) and track scaling trends, while SFE (Zhou et al., 12 Jun 2025) decomposes scientific reasoning into perception, attribute understanding, and comparative reasoning.
Practical Relevance: Performance on realistic assessment tasks underpins claims about readiness for deployment in domains like legal reasoning (II et al., 2022), university-level science (Dinh et al., 14 Jun 2024), and regionally localized applications (Liu et al., 10 Feb 2025).

The table below summarizes selected benchmarks and their distinguishing features:

Benchmark	Coverage/Source	Modalities	Grading	Notable Findings
Humanity's Last Exam (HLE) (Phan et al., 24 Jan 2025)	2.5K expert-crafted, adversarially filtered	Text, images	Auto (MC/EM), experts	<10% accuracy for SOTA, high calibration error
SciEx (Dinh et al., 14 Jun 2024)	Univ. CS exams (DE, EN)	Text, images, math	Experts, LLMs	LLMs average <60%; GPT-4V human-level, freeform hard
HLCE (Li et al., 15 Jun 2025)	ICPC/IOI 2010–2024	Code-gen, I/O	Execution, AUC	Pass@1 <16% for best models, self-recognition gap
SeaExam (Liu et al., 10 Feb 2025)	Official SEA school exams	Local multi-lingual	Auto (MC), human	Model accuracy lower than EN-translated, shows gaps
EESE (Wang et al., 22 Jul 2025)	100K+ QA, periodic sample	Text, all science	Auto, expert	Dynamic sample, separates model strengths, leakage-safe

5. Challenges, Limitations, and Design Considerations

Despite their advantages, exam-based benchmarks must confront several technical and conceptual issues:

Data Leakage and Contamination: Widely released or static benchmarks become incorporated into model pretraining data, invalidating claims of generalization. EESE’s periodic, non-public sampling (Wang et al., 22 Jul 2025) and HLE’s private question pool address this risk.
Benchmark-Driven Selection: As argued in (Spelda et al., 13 Aug 2025), the use of benchmarks as part of training curricula—intentionally or due to data contamination—undermines their efficacy as unseen evaluations. What was intended as a measurement tool becomes a learning curriculum, potentially inflating performance.
- Models like DeepSeek‑R1 demonstrate improved results after exposure to benchmark tasks in training, exemplifying this pitfall.
- To mitigate, future efforts must maintain contaminated/uncontaminated splits and adapt evaluation pipelines (e.g., held-out, non-public, or rolling benchmarks).
Alignment with Human Concepts of Difficulty: As documented in (Davis, 2014) and (Dinh et al., 14 Jun 2024), question difficulty as perceived by humans may not correlate with AI difficulty, since AIs can exploit surface cues unnoticed by test designers or lack domain-transfer abilities.
Coverage versus Depth: Scaling exam benchmarks to cover multiple disciplines, cognitive levels, languages, and modalities (EESE, SFE, SciEx) is essential for forward compatibility, but it increases annotation, review, and maintenance overhead.

6. Extensions, Innovations, and Future Directions

Emerging trends in exam-based benchmarks refine and extend core methodologies:

Automated and Adaptive Question Generation: Models like ExamGAN (Wu et al., 2021) dynamically generate exam scripts with targeted difficulty and coverage, facilitating ongoing expansion of evaluation pools without solely relying on human input.
Execution-Based and Practical Benchmarks: In domains such as code generation, frameworks like CodeBenchGen (Xie et al., 31 Mar 2024) emphasize functional evaluation via automated test cases, raising the bar for practical, reproducible assessment.
Iterative and Dynamic Benchmarking: Periodic updating (EESE), adversarial filtering (HLE), and non-public test pools create moving targets resistant to memorization and overfitting.
Peer- and LLM-Driven Grading: Techniques leveraging LLMs for grading and meta-evaluation (LLM-as-a-Judge (Dinh et al., 14 Jun 2024), decentralized peer examination (Bai et al., 2023)) enable benchmarking on open-ended, subjective, or creative tasks at scale, with robust correlation to expert assessment.

7. Societal Impact and Policy Relevance

Exam-based benchmarks anchor public and regulatory discourse on AI progress. By mapping model performance to widely recognized standards (e.g., the Bar Exam (II et al., 2022), “medal-level” programming (Li et al., 15 Jun 2025), or advanced science exams (Wang et al., 22 Jul 2025)), policy discussions on AI deployment, capability bounds, and safety find a concrete reference point. Their ongoing refinement and global, multilingual expansion (SeaBench/SeaExam (Liu et al., 10 Feb 2025), EESE) reinforce their centrality in cross-cultural and interdisciplinary assessment of next-generation AI systems.

Exam-based benchmarks represent an evolving synthesis of robust evaluation protocols, rigorous expert-driven content, and scalable, dynamic infrastructures. They play an indispensable role in mapping, differentiating, and ultimately advancing the frontier of scientific, technical, and cognitive capabilities in artificial intelligence systems.