mmJEE-Eval: Bilingual Multimodal STEM Reasoning Benchmark

Updated 5 April 2026

mmJEE-Eval is a bilingual multimodal benchmark that assesses vision–language models using exam-style STEM questions with diagrammatic and symbolic challenges.
The dataset comprises 1,460 rigorously aligned English–Hindi questions from India's JEE Advanced, ensuring balanced coverage across Physics, Chemistry, and Mathematics.
It incorporates negative marking, meta-cognitive probes, and contamination-aware splits to rigorously test scientific reasoning and model calibration.

mmJEE-Eval is a bilingual multimodal benchmark designed to audit the scientific reasoning capabilities of vision–LLMs (VLMs) in exam-style, high-stakes STEM problem solving. Sourced from India’s Joint Entrance Examination (JEE) Advanced between 2019–2025, the dataset features 1,460 rigorously aligned English–Hindi questions from Physics, Chemistry, and Mathematics, with significant coverage of diagrammatic and symbolic challenges, negative marking, and meta-cognitive requirements. mmJEE-Eval is distinguished by its ability to expose reasoning, calibration, and multilingual integration deficits in frontier and open VLMs, revealing substantial performance gaps invisible on prior benchmarks (Mukherjee et al., 12 Nov 2025).

1. Motivation and Rationale

mmJEE-Eval addresses limitations of existing multimodal reasoning benchmarks such as MMMU and MathVista, where SOTA VLMs reach high accuracy (78–85%) primarily through pattern-matching and template familiarity rather than genuine scientific articulation. These prior benchmarks insufficiently test for visual-symbolic integration, calibration under negative marking, and meta-cognitive self correction. mmJEE-Eval was constructed to provide:

Exam-style evaluation with high-stakes marking (including penalties)
Bilingual, natively aligned English and Hindi text, without translation artifacts
Domain-balanced questions across Physics, Chemistry, and Mathematics
Systematic probes for reasoning depth, error calibration, and cross-lingual consistency
Separation of reasoning from memorization via annually refreshed held-out 2025 questions and contamination-aware splits

A key objective is to discriminate performance based on true scientific reasoning, rather than rote template recognition or superficial answer extraction (Mukherjee et al., 12 Nov 2025).

2. Dataset Structure and Construction

The mmJEE-Eval corpus comprises 1,460 questions extracted from the official JEE Advanced examination papers (2019–2025), covering two papers per year, with only bilingual years (post-2019) included. Each question appears in both English and Hindi, natively typeset and extracted as high-resolution raster images; automatic translation is explicitly avoided.

Domain Balance: Chemistry (492, 33.7%), Mathematics (492, 33.7%), Physics (476, 32.6%)
Visual Requirement: 30.3% of questions require diagrams (“Image Required”); 69.7% present optional screenshots for OCR stress-testing
Question Types:
- MCQ-Single correct: +3/–1/0 (A–D)
- MCQ-Multiple correct: +4 for fully correct, +1 partial, –2 any incorrect
- Numerical: three–five digit or decimal (+4/0)
- Matching: mapping left/right columns (+4/–1/0)
Ground Truth Verification: Majority vote among six sources (five coaching institutes plus the official JEE key), with manual adjudication for disputed cases (≥60% consensus required)
Data Splits:
- Training/Development: 2019–2023 (1,156 questions)
- Validation/Test: 2024 (204 questions)
- Held-Out: 2025 (190 questions)
Input Modalities: Each instance includes a full image (text and any diagrams/figures), metadata (year, subject, question type, language), and a LaTeX-formatted answer

Temporal splits are recommended to control for potential dataset contamination and robustly evaluate generalization.

3. Evaluation Protocols

mmJEE-Eval quantifies VLM performance using a battery of metrics capturing both classic problem-solving and meta-cognitive competencies:

Pass@1 accuracy: Aggregate over k=10 runs, on both full set and 2025 held-out questions
Official “Marks (%)”: Calculation by true JEE Advanced rubric, integrating positive and negative marking (see Appendix F in (Mukherjee et al., 12 Nov 2025))
Confidence thresholding: Optional use of model output self-consistency to avoid negative marking, especially in MCQs
Meta-cognitive probes:
- Error Presence (EP): Can the VLM detect its own errors?
- Error Correction (EC): Can it correct the error if detected?
- EP→EC: Chained evaluation of self-correction ability
Cross-lingual Analysis: Percentage of questions correctly answered in only English, only Hindi, both, or neither
Contamination Tests: Comparison of performance on 2019–2024 vs. 2025 held-out, to exclude memorization
Ablation Studies: Variants controlling for OCR pipeline (EasyOCR vs. Gemma 3 OCR), presence/absence of diagrams, text-only settings

In addition, a subset of 400 model responses is annotated for reasoning complexity via a human + LLM-as-a-Judge protocol, scoring: conceptual understanding, grounding in visual context, computational accuracy, and instruction-following (0–10 scale per Bloom’s taxonomy).

4. Empirical Results and Comparative Analysis

On the 2025 held-out set, mmJEE-Eval reveals marked stratification not observed in previous benchmarks:

Model Category	Pass@1 Accuracy (2025)	Marks (%)	Comment
Open-source (400B↓)	9–46%	32–37%	Qwen 3 VL: 46.2%
Closed frontier	72.7–79.8%	66–83%	GPT-5: 79.5%, Gemini 2.5 Flash: 79.8%

Key findings:

Open-vs-Closed Disparity: The top open model (Qwen 3 VL) achieves only 46.2% on mmJEE-Eval, versus ~80% for state-of-the-art closed models (GPT-5, Gemini 2.5), creating a ~35% gap—substantially larger than the 6–10% gap seen on MMMU/MathVista.
Meta-cognitive Deficits: Error Presence rates span 21.9–79.5%, yet net error correction (EP→EC) remains extremely low: GPT-5 achieves just 5.2% improvement, highlighting pervasive meta-cognitive brittleness.
Self-consistency and Sampling: Pass@3 boosts accuracy ~30%, but even then, self-correction remains inadequate.
Contamination and Generalization: Minimal memorization is detected; open models exhibit stable or slight gains (+0.5–1.9%) from 2024 to 2025, while closed models drop 1.9–5.3%, confirming genuine challenge.
Ablation Sensitivity: Closed models are highly sensitive to OCR pipeline quality; Gemini 2.5 Flash, for example, drops from 79.5% (native OCR) to 31.1% under EasyOCR, only partially recovering with improved OCR or reinstated diagrams.

A notable observation is that reasoning quality (0–10 scale) is similar between Llama 4 Scout (7.07) and GPT-5 (7.08), yet actual answer accuracy diverges sharply (40.4% vs. 83.9%), implicating instruction-following and integrated visual-symbolic reasoning as principal limiting factors.

5. Analysis of Reasoning Depth and Failure Modes

mmJEE-Eval includes systematic annotation to separate superficial conceptual fluency from higher-order skills:

Reasoning Complexity: Human + LLM-as-a-Judge protocols demonstrate that while VLMs maintain basic STEM fluency, they struggle with problems demanding deep integration of visual cues, symbolic manipulation, and instruction-following.
Calibration under Negative Marking: Many models are poorly calibrated, failing to withhold answers when uncertain, resulting in negative marks.
Language Sensitivity: Cross-lingual consistency is explicitly assessed; few models achieve robust accuracy in both English and Hindi, and OCR/diagram dependence is substantial.

Error analysis indicates that most accuracy gaps arise in questions demanding joint reasoning over diagrams, mathematical notation, and following elaborate instructions, with visual-symbolic fusion being the most persistent bottleneck.

6. Implications for Benchmark Design and Model Development

mmJEE-Eval demonstrates that high performance on prior multimodal benchmarks does not guarantee scientific reasoning capabilities—especially in authentic multilingual, multimodal, and negative-marking conditions.

Multimodal Integration: The benchmark uniquely stresses the need for VLMs to reconcile diagramatic, textual, and symbolic information, in bilingual settings.
Benchmark Granularity: Including real exam-style negative marking and human-aligned subjective assessment surfaces weaknesses not captured by pass/fail classification alone.
Meta-cognition: The challenge of error detection and correction (EP/EC) is exposed as a frontier problem for VLMs.
Dataset Construction: Use of natively bilingual, professionally typeset exam documents with no translation further precludes the "shortcut" of monolingual training.

This suggests that progress on such benchmarks requires VLM advances in OCR-robust diagram understanding, multilingual instruction-following, and integrated meta-cognitive routines for error correction.

7. Access, Reproducibility, and Future Research Directions

The mmJEE-Eval benchmark, together with annotation tools, evaluation scripts, and code, is released publicly under the MIT license at https://mmjee-eval.github.io, with distribution via Hugging Face (ArkaMukherjee/mmJEE-Eval). Documentation includes evaluation notebooks illustrating the full metric suite, ablation analyses, and protocols for annual updates, supporting reproducible, community-driven research.

Prominent future directions include:

Research into architectures and pretraining objectives for improved visual-symbolic grounding and OCR resilience
Development of fine-grained, multilingual calibration strategies for answer confidence under negative marking
Mechanisms to internalize meta-cognitive "loops" for self-correction and instruction reconciliation
Extension of the mmJEE-Eval paradigm to other high-stakes, real-world exam scenarios in further language and cultural contexts

By combining exam-style rigor, multimodal complexity, and bilingual fidelity, mmJEE-Eval establishes a new standard for auditing integrated scientific reasoning in VLMs, driving the field beyond surface pattern recognition toward truly integrated, reasoning-capable artificial intelligence (Mukherjee et al., 12 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

mmJEE-Eval: A Bilingual Multimodal Benchmark for Evaluating Scientific Reasoning in Vision-Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to mmJEE-Eval.

mmJEE-Eval: Bilingual Multimodal STEM Reasoning Benchmark

1. Motivation and Rationale

2. Dataset Structure and Construction

3. Evaluation Protocols

4. Empirical Results and Comparative Analysis

5. Analysis of Reasoning Depth and Failure Modes

6. Implications for Benchmark Design and Model Development

7. Access, Reproducibility, and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

mmJEE-Eval: Bilingual Multimodal STEM Reasoning Benchmark

1. Motivation and Rationale

2. Dataset Structure and Construction

3. Evaluation Protocols

4. Empirical Results and Comparative Analysis

5. Analysis of Reasoning Depth and Failure Modes

6. Implications for Benchmark Design and Model Development

7. Access, Reproducibility, and Future Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research