Papers
Topics
Authors
Recent
2000 character limit reached

JEE for STEM: Benchmarking AI Evaluations

Updated 26 November 2025
  • JEE for STEM is a canonical framework using authentic JEE Advanced exam questions to assess high-level STEM problem solving in AI.
  • It integrates multimodal, bilingual, and diagram-rich items that require long-horizon deductive reasoning and cross-domain concept integration.
  • Benchmark evaluations employ metrics such as Pass@1, marks percentage, and self-correction to highlight both AI capabilities and performance gaps.

The Joint Entrance Examination (JEE) Advanced serves as a canonical assessment of pre-college STEM proficiency and high-level scientific reasoning in the Indian educational context. Recent years have seen the emergence of JEE-based benchmarks, notably mmJEE-Eval and JEEBench, as principal test suites for evaluating LLMs and vision-LLMs (VLMs) on challenging, multistep STEM problem solving. These benchmarks foreground long-horizon deductive reasoning, cross-lingual understanding, and robust interpretability of multimodal scientific input, providing a rigorous alternative to existing breadth-focused datasets.

1. Dataset Composition and Structure

Both mmJEE-Eval (Mukherjee et al., 12 Nov 2025) and JEEBench (Arora et al., 2023) harness authentic JEE Advanced exam problems to target high-level pre-university Physics, Chemistry, and Mathematics domains.

Benchmark Years Covered Total Questions Modalities Bilinguality Image/Diagram %
mmJEE-Eval 2019–2025 1,460 Images, LaTeX English, Hindi 30.3
JEEBench 2016–2023 515 Text, LaTeX English n/a

Key structural features:

  • Subject balance: mmJEE-Eval allocates 492 Chemistry (33.7%), 492 Mathematics (33.7%), and 476 Physics (32.6%) questions; JEEBench distributes 236 Mathematics, 123 Physics, and 156 Chemistry items.
  • Bilinguality and format: mmJEE-Eval provides questions in both English and Hindi, delivered as native exam booklet images, supporting rigorous cross-lingual evaluation.
  • Multimodality: 30.3% of mmJEE-Eval questions require explicit image interpretation—including diagrams of physical systems, chemical structures, and graphical data—while all remaining items retain screenshot-based presentation to probe OCR and visual-layout understanding.
  • Answer types: Numerical, MCQ-single/multiple, and matching; rigorous adherence to negative marking and progressive multi-step reasoning.

2. Problem Characteristics and Scientific Reasoning Demands

JEE-derived benchmarks exemplify tasks that blend advanced content knowledge, algebraic manipulation, and scientific method application:

  • Long-horizon reasoning: Problems typically require multi-step deduction—from principle identification to mathematical modeling to precise computation.
  • Conceptual integration: Items integrate topics across subdomains, e.g., stoichiometry with thermodynamics or trigonometric composition with calculus.
  • Exam-style constraints: Negative marking and variety in item format (MCQ-single/multiple, free-response) drive decision-theoretic considerations absent from many benchmarks.

Representative problems exemplify these features:

  • Physics: Kinetic energy computation for a pendulum subjected to horizontal impulse, requiring diagram interpretation and projection calculations.
  • Chemistry: Stoichiometry of gas evolution with LaTeX chemical equations, testing both reasoning and OCR capabilities.
  • Mathematics: Multi-layered limit composition involving symbolic differentiation and LaTeX parsing.

In JEEBench, sub-topic diversity is pronounced, ranging across fields such as static equilibrium in physics, complex integration in mathematics, and redox balancing in chemistry. Problems routinely demand 5–10 minutes of expert-level multi-step reasoning (Arora et al., 2023).

3. Model Evaluation Protocols and Performance Metrics

Both benchmarks implement evaluation criteria modeled after authentic exam practice:

  • Held-out splits: JEEBench uses a 2016–2021 validation set for calibration; mmJEE-Eval withholds a full year (190 questions, 2025) for leakage control and future-proofing.
  • Performance metrics:
    • Pass@1: Accuracy on a single sampled response.
    • Marks (%): Raw exam-style scoring with negative marking.
    • Pass@k: Accuracy over the best of k model samples (k = 3,5,10).
    • Chain-of-thought (CoT) and self-consistency: Aggregating multiple stepwise sampled solutions.
    • Error presence/correction (EP/EC): Model's introspective capacity to detect and amend its own mistakes.
    • Cross-lingual and visual robustness: Evaluating performance on image-based and bilanguage-paired items.

Results from mmJEE-Eval’s 2025 held-out split demonstrate substantial capability gaps:

  • Frontier VLMs (GPT-5, Gemini 2.5 Pro/Flash): 77–84% Pass@1; 80.8–83.3% marks (Mukherjee et al., 12 Nov 2025).
  • Open-source models (Qwen3 VL 235B): ~46% Pass@1; 37% marks.
  • Mid-scale open models plateau: 22–40% Pass@1.
  • Error correction meta-cognition is minimal: 5.2% for GPT-5, while simple Pass@3 sampling leads to ≈30% gains.
Model Pass@1 (%) Marks (%) Self-correction (%)
GPT-5 79.5 80.8 5.2
Gemini 2.5 Flash 79.8 83.3 —
Qwen3 VL 235B ~46 37 —

JEEBench, in aggregated accuracy:

  • GPT-4 with self-consistency: 38.9% overall (46.8% Chemistry, 28.0% Mathematics, 33.5% Physics).
  • Prompting with stepwise CoT yields +4.2pp over plain answers.
  • Post-hoc confidence thresholding, calibrated via held-out sets, enhances decision reliability under negative marking (Arora et al., 2023).

4. Cognitive and Modality Failure Modes

Model evaluation with JEE-based benchmarks identifies failure signatures not discernible with simpler datasets:

  • Conceptual retrieval errors: Misapplication or omission of core scientific principles (accounting for 34% of GPT-4 CoT failures).
  • Algebraic/computational errors: Calculation mistakes, misapplied equations, and sign errors (~30%).
  • Grounding errors: Correct scientific concepts mapped to incorrect mathematical formulations (15%).
  • Visual-symbolic integration: Notably, inferior OCR pipelines (e.g., generic EasyOCR) induce 50-point mark drops, while specialized provider OCR can recover about 25 points in frontier VLMs (Mukherjee et al., 12 Nov 2025).
  • Meta-cognitive deficits: Automated self-verifiers fail to recognize or correct a majority of their own errors. EP/EC success remains around 3–5%, with substantial boosts achievable using sampling-based aggregation (Pass@3) rather than introspective methods.

This suggests that advances in scaling and architectural complexity alone do not close reasoning gaps without explicit integration of multi-step context management and domain-specific symbolic grounding.

5. Strategies for Maximizing Automated STEM Problem Solving

Best practices for LLM-based solution pipelines on JEE-level benchmarks are codified as follows (Arora et al., 2023):

  1. Use explicit chain-of-thought prompting ("Let's think step by step") to induce logical sequencing.
  2. Aggregate multiple sampled CoTs and majority-vote (self-consistency) for robust answer extraction.
  3. Employ calibrated post-hoc confidence thresholds—calculated as sample frequencies per answer choice—to manage negative marking trade-offs.
  4. Avoid sole reliance on LLM self-critique or verifier models, which often fail to detect critical reasoning mistakes.
  5. Structure prompts to require explicit principle citation, all intermediate derivations, and outcome verification (e.g., dimensional analysis).
  6. Integrate external calculator APIs for algebraic computation, monitoring for chain integrity.
  7. Employ annually refreshed, held-out validation sets to recalibrate all thresholding and mitigate data leakage.

6. Implications for STEM Education and Benchmark Evolution

JEE-based benchmarks are positioned as reference environments for both AI research and educational technology:

  • Benchmarking AI: mmJEE-Eval’s multimodal, bilingual, diagram-rich structure enforces rigorous separation between surface pattern-matching and genuine scientific reasoning (Mukherjee et al., 12 Nov 2025). Year-wise splits enable longitudinal studies of both curricular progression and tutoring regime efficacy.
  • Automated instruction: Embedding these items into intelligent tutoring platforms supports nuanced diagnosis—distinguishing conceptual, computational, and perceptual errors.
  • Cross-lingual and accessibility advancements: Bilingual data empowers development of Hindi-medium STEM tutors, addressing linguistic equity concerns.
  • Sustained novelty: Expansion plans include continuous annual addition of JEE Advanced questions, inclusion of other Indian and international entrance exam content, and introduction of additional Indic languages to probe zero-shot cross-lingual and cross-curricular transfer.

A plausible implication is that as VLMs and LLMs cross the threshold from pattern-matching to authentic scientific reasoning, JEE-derived benchmarks will define the substantive corridor for both educational and foundational AI evaluation.

7. Benchmark Significance and Future Directions

JEE-centered STEM benchmarks decisively escalate the reasoning and content mastery demands placed upon contemporary AI systems. Unlike MMLU or MathVista, where frontier VLMs cluster at ceiling (78–85% accuracy), JEE-based initiatives like mmJEE-Eval systematically expose deficits in conceptual integration, multi-step reasoning, and visual-symbolic grounding. They enable granular differentiation of architectures, training methodologies, and cross-modal fusion strategies.

As benchmark designers incorporate additional modalities (e.g., interactive or animated lab-style problems), as well as broader linguistic and curricular scope, these frameworks are likely to remain central to both STEM education research and the advancement of high-complexity AI reasoning systems (Mukherjee et al., 12 Nov 2025, Arora et al., 2023).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to JEE for STEM.