JEEBench: STEM AI Benchmark
- JEEBench is a rigorously curated benchmark with 515 STEM problems from IIT JEE-Advanced, designed to test advanced scientific and mathematical reasoning.
- It covers diverse formats including single/multi-correct MCQs, integer-type, and numeric-type questions, integrating multi-step deduction, algebraic manipulation, and exam strategies like negative marking.
- Benchmark evaluations reveal a significant performance gap between open-source LLMs (≈10–11% accuracy) and proprietary models (up to 38.9%), underscoring the need for improved methodologies.
JEEBench is a rigorously curated benchmark for evaluating the advanced scientific and mathematical reasoning capabilities of large language and vision-LLMs (LLMs and VLMs). It consists of challenging problems drawn verbatim from India’s Joint Entrance Examination Advanced (IIT JEE-Advanced), a high-stakes exam designed to select the top ~5% of engineering aspirants nationwide. JEEBench tests not only multi-step deduction, algebraic manipulation, and domain knowledge, but also the integration of exam strategy (including negative marking) and symbolic processing. Since its introduction, it has become a de facto standard for hard pre-university STEM reasoning in the evaluation of generalist and specialized AI systems.
1. Composition and Characteristics
JEEBench, as originally constructed by Arora et al., contains 515 problems spanning Mathematics (236), Physics (123), and Chemistry (156), each hand-extracted from eight years of JEE-Advanced papers (2016–2023) (Arora et al., 2023). Problems with diagrams were excluded, but all remaining items underwent conversion to LaTeX and manual proofreading to ensure fidelity. The benchmark includes multiple problem formats:
| Answer Type | Description | Count (approx.) |
|---|---|---|
| Single-correct MCQ | One correct answer out of four | 110 |
| Multi-correct MCQ | Any subset of options correct | 186 |
| Integer-type | Non-negative integer | 82 |
| Numeric-type | Real value, ±0.01 tolerance | 137 |
Problems are uniformly formulated in natural language, often augmented by mathematical notation, and require long-horizon, multi-step chains of reasoning. Multi-modal and bilingual extensions exist in mmJEE-Eval (Mukherjee et al., 12 Nov 2025), with visual elements (e.g., diagrams or chemical structures) as part of the problem statement and questions appearing in both English and Hindi.
Difficulty is calibrated to the entrance exam’s selection intent, involving advanced conceptual retrieval, derivation of governing equations from first principles, deep algebraic or calculus-based manipulation, and precise exam-oriented calculation. Spatial reasoning, cross-topic dependencies, and caution regarding negative marking further elevate solution complexity. JEEBench thus targets reasoning far beyond elementary arithmetic or factual recall.
2. Evaluation Protocols and Metrics
Models are typically evaluated on JEEBench using exact-match accuracy—i.e., the fraction of questions where the predicted answer precisely matches the official key (allowing partial credit for multi-correct MCQ) (Arora et al., 2023). For the original benchmark and text-only settings, the formula is:
Expanded protocols in multimodal and bilingual variants use “Pass@1” and “Pass@k” metrics—sampling k outputs per problem and scoring as solved if any answer matches the key (Mukherjee et al., 12 Nov 2025). When negative marking is present, adjusted “Marks (%)” and confidence-thresholded scoring (skipping low-confidence questions) are used to better mirror exam strategy (Arora et al., 2023).
The benchmark does not define official train/val/test splits, but year-based separation (2019–2023 “seen,” 2024–2025 “held out”) is used to avoid contamination in mmJEE-Eval (Mukherjee et al., 12 Nov 2025). Item-difficulty histograms and error breakdowns (conceptual, computation, grounding, instruction-following) are reported in several works.
3. Baseline Performance and Error Typology
Initial evaluations reveal JEEBench to be substantially more difficult than prior math or science benchmarks (e.g., GSM8K, MATH, MathVista, MMMU). With vanilla or chain-of-thought (CoT) prompting, open-source LLMs typically plateau ≈10–11% accuracy, establishing an “open frontier” baseline (Arora et al., 2023). Proprietary models perform better, e.g., GPT-4 peaks at 38.9% (with zero-shot CoT and self-consistency), but still well below human levels.
Detailed error analyses (100 CoT responses, GPT-4) classify typical failures as:
- Conceptual errors (e.g., missing relevant scientific law): 34/80
- Computation errors (arithmetic slip, algebra): 30/80
- Grounding errors (mis-translating domain knowledge to equations): 15/80
- Flawed corrects (right answer, wrong reasoning): 28%
Even high-performing models exhibit brittle algebra and unreliable risk assessment under negative marking. Exam-mimetic features such as confidence-thresholding and response calibration reduce penalty-induced failures, but do not close the performance gap (Arora et al., 2023).
4. Research Advances Leveraging JEEBench
4.1. Multi-Agent Systems and Modular Societies
- HASHIRU: An agent-society framework employing hierarchical delegation (CEO agent and employee models), hybrid cost-sensitive intelligence, and automated tool invocation achieves an 80% success rate (on a 120-question JEEBench subset), surpassing Gemini 2.0 Flash (68.3%) with statistical significance (p < 0.05) (Pai et al., 1 Jun 2025). The system’s dynamic routing of sub-tasks to symbolic tools and larger LLMs, guided by an explicit economic model, is credited with this performance jump.
- LM²: A three-model “society”—decomposer, solver, and verifier—modularizes question decomposition, subproblem resolution, and mistake checking, trained with policy gradient coordination (Juneja et al., 2024). LM² outperforms prior decomposition and structured CoT approaches by 7.71 percentage points across all JEEBench subject-answer pairs, with ablations revealing significant drops upon removing the verifier, concept lists, or PPO-stage training.
4.2. Prompt Engineering and Information Structuring
- Story of Thought (SoT): Incorporates multi-step narrative scaffolding—clarification, narrative generation (analogy, progressive disclosure), and narrative-conditioned solution—yielding the largest absolute accuracy gains (up to +20.4 percentage points) over best non-narrative baselines for Llama 3 70B on JEEBench (Javadi et al., 2024). Causal narrative organization and metaphor reduce abstraction burden and error-prone algebraic leaps.
- Trace Inversion: Fine-tuning open models on synthetic reasoning chains (generated by autoregressive inversion from only targets and summaries) delivers a 50–100% relative improvement over answer-only baselines (e.g., Qwen-2.5-7B-Instruct jumps from 11.7% to 42.3% with traces inverted from GPT-5 mini outputs) (Zhang et al., 7 Mar 2026).
4.3. Multimodal and Bilingual Extensions
- mmJEE-Eval: Extends JEEBench to 1,460 bilingual, multimodal questions incorporating diagrams and physical/chemical structures (Mukherjee et al., 12 Nov 2025). Open-source 400B VLMs achieve only 37–51% on the hardest held-out partition, while closed proprietary models (GPT-5, Gemini 2.5) score 77–83.9%. Losses under OCR ablation and meta-cognitive error fixing highlight the reliance of current models on both symbolic integration and visual context.
5. Innovations in Training, Prompting, and Tool Use
JEEBench has emerged as a standard testbed for divergent strategies in automated scientific reasoning:
- Decomposition and verification: Modular societies (e.g., LM²) and hierarchical agent stacks (e.g., HASHIRU) partition problem-solving into concept extraction, sub-task generation, solution, and adaptive error-checking, boosting robustness relative to monolithic CoT (Juneja et al., 2024, Pai et al., 1 Jun 2025).
- Narrative and analogy: Structured narrative prompting (SoT) acts as contextual scaffolding, exposing LLMs to causal and analogy-laden reasoning, which empirical ablations show is crucial for cross-domain transfer and for larger models to attain SOTA on JEEBench (Javadi et al., 2024).
- Confidence thresholding: Algorithms pruning low-confidence responses (e.g., via self-consistency thresholds) mitigate negative marking, improve calibration, and are validated as effective risk management proxies (Arora et al., 2023).
- Trace inversion and distillation: Attack-resilient training enables downstream models to recover a significant portion of reasoning accuracy, even without ground-truth chain-of-thought, by mining answer/summaries from API-exposed commercial LLMs (Zhang et al., 7 Mar 2026).
6. Benchmark Impact, Limitations, and Future Directions
JEEBench exposes large, persistent accuracy gaps (often 30–40 percentage points) between scaled open models and commercial VLMs that do not manifest on prior benchmarks (e.g., MathVista, MMMU) (Mukherjee et al., 12 Nov 2025). Its design—exam-oriented, negative marking, multi-subject, and multimodal—prevents ceiling effects, resist memorization-based shortcuts, and foregrounds meta-cognitive failures (error-detection: only ~1–5% can be self-corrected).
A plausible implication is that training methodology, data diversity, and symbolic/visual integration are more predictive of JEEBench robustness than model scale alone. New directions emphasize tightly-coupled multimodal and multilingual training, learned verifiers, and native planning modules for risk-aware solution generation (Mukherjee et al., 12 Nov 2025, Juneja et al., 2024).
By maintaining separation from train data, actively resisting contamination, and embracing complex reasoning chains over pattern-matching, JEEBench stands as a premier, continually evolving touchstone for evaluating and spurring advances in scientific AI reasoning (Arora et al., 2023).