JEE for STEM: Benchmarking AI Evaluations
- JEE for STEM is a canonical framework using authentic JEE Advanced exam questions to assess high-level STEM problem solving in AI.
- It integrates multimodal, bilingual, and diagram-rich items that require long-horizon deductive reasoning and cross-domain concept integration.
- Benchmark evaluations employ metrics such as Pass@1, marks percentage, and self-correction to highlight both AI capabilities and performance gaps.
The Joint Entrance Examination (JEE) Advanced serves as a canonical assessment of pre-college STEM proficiency and high-level scientific reasoning in the Indian educational context. Recent years have seen the emergence of JEE-based benchmarks, notably mmJEE-Eval and JEEBench, as principal test suites for evaluating LLMs and vision-LLMs (VLMs) on challenging, multistep STEM problem solving. These benchmarks foreground long-horizon deductive reasoning, cross-lingual understanding, and robust interpretability of multimodal scientific input, providing a rigorous alternative to existing breadth-focused datasets.
1. Dataset Composition and Structure
Both mmJEE-Eval (Mukherjee et al., 12 Nov 2025) and JEEBench (Arora et al., 2023) harness authentic JEE Advanced exam problems to target high-level pre-university Physics, Chemistry, and Mathematics domains.
| Benchmark | Years Covered | Total Questions | Modalities | Bilinguality | Image/Diagram % |
|---|---|---|---|---|---|
| mmJEE-Eval | 2019–2025 | 1,460 | Images, LaTeX | English, Hindi | 30.3 |
| JEEBench | 2016–2023 | 515 | Text, LaTeX | English | n/a |
Key structural features:
- Subject balance: mmJEE-Eval allocates 492 Chemistry (33.7%), 492 Mathematics (33.7%), and 476 Physics (32.6%) questions; JEEBench distributes 236 Mathematics, 123 Physics, and 156 Chemistry items.
- Bilinguality and format: mmJEE-Eval provides questions in both English and Hindi, delivered as native exam booklet images, supporting rigorous cross-lingual evaluation.
- Multimodality: 30.3% of mmJEE-Eval questions require explicit image interpretation—including diagrams of physical systems, chemical structures, and graphical data—while all remaining items retain screenshot-based presentation to probe OCR and visual-layout understanding.
- Answer types: Numerical, MCQ-single/multiple, and matching; rigorous adherence to negative marking and progressive multi-step reasoning.
2. Problem Characteristics and Scientific Reasoning Demands
JEE-derived benchmarks exemplify tasks that blend advanced content knowledge, algebraic manipulation, and scientific method application:
- Long-horizon reasoning: Problems typically require multi-step deduction—from principle identification to mathematical modeling to precise computation.
- Conceptual integration: Items integrate topics across subdomains, e.g., stoichiometry with thermodynamics or trigonometric composition with calculus.
- Exam-style constraints: Negative marking and variety in item format (MCQ-single/multiple, free-response) drive decision-theoretic considerations absent from many benchmarks.
Representative problems exemplify these features:
- Physics: Kinetic energy computation for a pendulum subjected to horizontal impulse, requiring diagram interpretation and projection calculations.
- Chemistry: Stoichiometry of gas evolution with LaTeX chemical equations, testing both reasoning and OCR capabilities.
- Mathematics: Multi-layered limit composition involving symbolic differentiation and LaTeX parsing.
In JEEBench, sub-topic diversity is pronounced, ranging across fields such as static equilibrium in physics, complex integration in mathematics, and redox balancing in chemistry. Problems routinely demand 5–10 minutes of expert-level multi-step reasoning (Arora et al., 2023).
3. Model Evaluation Protocols and Performance Metrics
Both benchmarks implement evaluation criteria modeled after authentic exam practice:
- Held-out splits: JEEBench uses a 2016–2021 validation set for calibration; mmJEE-Eval withholds a full year (190 questions, 2025) for leakage control and future-proofing.
- Performance metrics:
- Pass@1: Accuracy on a single sampled response.
- Marks (%): Raw exam-style scoring with negative marking.
- Pass@k: Accuracy over the best of k model samples (k = 3,5,10).
- Chain-of-thought (CoT) and self-consistency: Aggregating multiple stepwise sampled solutions.
- Error presence/correction (EP/EC): Model's introspective capacity to detect and amend its own mistakes.
- Cross-lingual and visual robustness: Evaluating performance on image-based and bilanguage-paired items.
Results from mmJEE-Eval’s 2025 held-out split demonstrate substantial capability gaps:
- Frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash): 77–84% Pass@1; 80.8–83.3% marks (Mukherjee et al., 12 Nov 2025).
- Open-source models (Qwen3 VL 235B): ~46% Pass@1; 37% marks.
- Mid-scale open models plateau: 22–40% Pass@1.
- Error correction meta-cognition is minimal: 5.2% for GPT-5, while simple Pass@3 sampling leads to ≈30% gains.
| Model | Pass@1 (%) | Marks (%) | Self-correction (%) |
|---|---|---|---|
| GPT-5 | 79.5 | 80.8 | 5.2 |
| Gemini 2.5 Flash | 79.8 | 83.3 | — |
| Qwen3 VL 235B | ~46 | 37 | — |
JEEBench, in aggregated accuracy:
- GPT-4 with self-consistency: 38.9% overall (46.8% Chemistry, 28.0% Mathematics, 33.5% Physics).
- Prompting with stepwise CoT yields +4.2pp over plain answers.
- Post-hoc confidence thresholding, calibrated via held-out sets, enhances decision reliability under negative marking (Arora et al., 2023).
4. Cognitive and Modality Failure Modes
Model evaluation with JEE-based benchmarks identifies failure signatures not discernible with simpler datasets:
- Conceptual retrieval errors: Misapplication or omission of core scientific principles (accounting for 34% of GPT-4 CoT failures).
- Algebraic/computational errors: Calculation mistakes, misapplied equations, and sign errors (~30%).
- Grounding errors: Correct scientific concepts mapped to incorrect mathematical formulations (15%).
- Visual-symbolic integration: Notably, inferior OCR pipelines (e.g., generic EasyOCR) induce 50-point mark drops, while specialized provider OCR can recover about 25 points in frontier VLMs (Mukherjee et al., 12 Nov 2025).
- Meta-cognitive deficits: Automated self-verifiers fail to recognize or correct a majority of their own errors. EP/EC success remains around 3–5%, with substantial boosts achievable using sampling-based aggregation (Pass@3) rather than introspective methods.
This suggests that advances in scaling and architectural complexity alone do not close reasoning gaps without explicit integration of multi-step context management and domain-specific symbolic grounding.
5. Strategies for Maximizing Automated STEM Problem Solving
Best practices for LLM-based solution pipelines on JEE-level benchmarks are codified as follows (Arora et al., 2023):
- Use explicit chain-of-thought prompting ("Let's think step by step") to induce logical sequencing.
- Aggregate multiple sampled CoTs and majority-vote (self-consistency) for robust answer extraction.
- Employ calibrated post-hoc confidence thresholds—calculated as sample frequencies per answer choice—to manage negative marking trade-offs.
- Avoid sole reliance on LLM self-critique or verifier models, which often fail to detect critical reasoning mistakes.
- Structure prompts to require explicit principle citation, all intermediate derivations, and outcome verification (e.g., dimensional analysis).
- Integrate external calculator APIs for algebraic computation, monitoring for chain integrity.
- Employ annually refreshed, held-out validation sets to recalibrate all thresholding and mitigate data leakage.
6. Implications for STEM Education and Benchmark Evolution
JEE-based benchmarks are positioned as reference environments for both AI research and educational technology:
- Benchmarking AI: mmJEE-Eval’s multimodal, bilingual, diagram-rich structure enforces rigorous separation between surface pattern-matching and genuine scientific reasoning (Mukherjee et al., 12 Nov 2025). Year-wise splits enable longitudinal studies of both curricular progression and tutoring regime efficacy.
- Automated instruction: Embedding these items into intelligent tutoring platforms supports nuanced diagnosis—distinguishing conceptual, computational, and perceptual errors.
- Cross-lingual and accessibility advancements: Bilingual data empowers development of Hindi-medium STEM tutors, addressing linguistic equity concerns.
- Sustained novelty: Expansion plans include continuous annual addition of JEE Advanced questions, inclusion of other Indian and international entrance exam content, and introduction of additional Indic languages to probe zero-shot cross-lingual and cross-curricular transfer.
A plausible implication is that as VLMs and LLMs cross the threshold from pattern-matching to authentic scientific reasoning, JEE-derived benchmarks will define the substantive corridor for both educational and foundational AI evaluation.
7. Benchmark Significance and Future Directions
JEE-centered STEM benchmarks decisively escalate the reasoning and content mastery demands placed upon contemporary AI systems. Unlike MMLU or MathVista, where frontier VLMs cluster at ceiling (78–85% accuracy), JEE-based initiatives like mmJEE-Eval systematically expose deficits in conceptual integration, multi-step reasoning, and visual-symbolic grounding. They enable granular differentiation of architectures, training methodologies, and cross-modal fusion strategies.
As benchmark designers incorporate additional modalities (e.g., interactive or animated lab-style problems), as well as broader linguistic and curricular scope, these frameworks are likely to remain central to both STEM education research and the advancement of high-complexity AI reasoning systems (Mukherjee et al., 12 Nov 2025, Arora et al., 2023).