HiSciBench: Multidisciplinary Science Benchmark
- HiSciBench is a multi-disciplinary benchmark assessing scientific intelligence through a five-level hierarchy that mirrors the research process.
- It integrates multimodal inputs—text, equations, images, and tables—across six core disciplines with 8,735 curated instances to diagnose model capabilities.
- Tailored evaluation metrics reveal performance gaps in citation accuracy, multimodal fusion, and procedural reasoning, guiding future AI model improvements.
HiSciBench is a hierarchical, multi-disciplinary benchmark designed to assess the scientific intelligence of LLMs and multimodal foundation models. It mirrors the real-world scientific workflow, ranging from basic factual recall to hypothesis-driven discovery. HiSciBench uniquely integrates five levels (Scientific Literacy, Literature Parsing, Literature-based Question Answering, Literature Review Generation, Scientific Discovery) and supports multimodal inputs—including text, equations, images, figures, and tables—across six core disciplines: mathematics, physics, chemistry, biology, geography, and astronomy. With 8,735 curated instances, HiSciBench enables granular, dependency-aware diagnosis of model capabilities and failure modes throughout the research pipeline (Zhang et al., 28 Dec 2025).
1. Hierarchical Structure of Scientific Reasoning
HiSciBench operationalizes scientific intelligence as a five-level hierarchy, each capturing a distinct stage in the research process:
- L1: Scientific Literacy (Perceive). Models must recall and explain fundamental concepts. Tasks include answering general multiple-choice questions (e.g., "What is the conservation of momentum?").
- L2: Literature Parsing (Parse). Models extract and translate the content of scientific documents. Subtasks: Document OCR and Parsing (producing Markdown including formulas from PDFs or figures) and Cross-lingual Translation (preserving technical semantics between languages).
- L3: Literature-based Question Answering (Reason). Models perform deep inference within scientific documents. Subtasks split into Monolingual QA (fine-grained questions on single papers) and Cross-lingual QA (questions about English sources posed in other languages).
- L4: Literature Review Generation (Synthesize). Given a topic and a set of core papers, models synthesize a coherent, critical survey, integrating findings and providing analytical context.
- L5: Scientific Discovery (Innovate). Models design and execute computational experiments using structured data, infer patterns, and propose novel hypotheses (e.g., predicting catalysts, analyzing climate data).
Each level is dependency-aware: outputs from each earlier stage feed into the subsequent stage, structurally modeling the cascade of real scientific work.
2. Dataset Composition and Multi-disciplinary Scope
The benchmark comprises 8,735 instances constructed from authentic scientific papers across six disciplines:
| Level | Subtask/Type | Instances | Domain Coverage |
|---|---|---|---|
| L1.1 | Multiple-choice QA | 1,200 | 200 per discipline |
| L2.1/2.2 | OCR/Parsing & Translation | 629 + 629 | Math 208, Physics 357, Astronomy 19, Biology 45 |
| L3.1 | Monolingual QA | 5,514 | Math 821, Physics 1,025, Chemistry 886, Astronomy 330, Geography 500, Biology 1,952 |
| L3.2 | Cross-lingual QA | 629 | See L2 for source pages |
| L4.1 | Topic-guided Reviews | 60 | 10 per discipline |
| L5.1 | Data-driven Discovery | 74 | Chemistry 20, Geography 27, Biology 27 |
Disciplinary balance is achieved: biology (26.6%), physics (26.4%), chemistry (≈14%), mathematics (≈12%), geography (≈11%), astronomy (≈9%). 84.7% of instances use multimodal inputs (image-text pairs, equations, tables); cross-lingual evaluation is embedded in L2.2 and L3.2.
3. Evaluation Methodologies and Metrics
HiSciBench employs metrics tailored to each task:
- L1 & L3 (QA): Accuracy, defined as .
- L2.1 (OCR): Word-level Accuracy, .
- L2.2 (Translation): BLEU, .
- L4 (Reviews): Content Quality (five LLM-judge criteria, each rated 1–5: Coverage, Structure, Relevance, Synthesis, Critical Analysis) and Citation Quality (Verifiability Rate, Metadata Accuracy, Faithfulness Rate, Citation Count, Source Count, Source Distribution Entropy, Recency Rate).
- L5 (Discovery): Success Rate (SR): ; "success" requires code to run error-free and meet scientific output criteria.
These multi-faceted metrics enable both quantitative aggregation and in-depth analysis of model abilities and weaknesses at each cognitive level.
4. Integrated, Dependency-Aware Workflow Modeling
HiSciBench is designed with explicit inter-level dependencies:
- D → (L2) T → (L3) A → (L4) R → (L5) C
- D = raw PDFs/images
- T = structured document parses or translated text (output of L2)
- A = answers generated from parsed documents and question sets (output of L3)
- R = literature review synthesized from documents and answers (output of L4)
- C = code/hypothesis generation (output of L5)
Earlier stages supply structured data and extracted knowledge to downstream modules; for example, OCR and translation outputs directly seed QA tasks, reviewed answers inform synthesis, and gaps mapped by review guide computational experimentation.
A plausible implication is that explicit modeling of these dependencies exposes composite failure modes, such as erroneous parsing propagating to incorrect hypothesis generation.
5. Model Performance and Diagnostic Findings
Evaluation results for leading models such as GPT-5 and Deepseek-R1 reveal distinct capability gradients:
| Level | Top Model(s) | Metric | Value(s) | Discipline/Dataset Gaps |
|---|---|---|---|---|
| L1 | GPT-5 | Accuracy | 69.17% | Math ≈ 84%, Biology ≈ 50% |
| L2.1 | GPT-5 | BLEU (Parsing) | 67.61 | Qwen3-VL-8B: 64.76; Intern-VL3.5-8B: 9.53 |
| L2.2 | GPT-5 | BLEU (Translation) | 43.29 | S1-Base-Pro-32B: 41.28 |
| L3.1 | Deepseek-v3 | Accuracy | 96.20% | Vision-language fragment: 80.45%; GPT-5: 76.75% |
| L3.2 | GPT-5 | Accuracy | 86.28% (VL) | Deepseek-R1: 79.83% (VL); 67.53% (text-only) |
| L4 | GPT-5 | Content Quality | 4.99 (out of 5) | Citation verifiability: SurveyX 71.4% va GPT-5 19.3% |
| L5 | GPT-5 | Success Rate | 24.75% | Deepseek-R1: 21.05% |
Models attain high accuracy in basic QA (L3), but performance sharply declines in scientific literacy (L1) and data-driven discovery (L5). Notably, less than 20% of citations generated by general LLMs in literature reviews are verifiable; syntactically correct code produced in L5 often fails scientific validation.
This suggests persistent bottlenecks in multimodal fusion, grounding, citation integrity, and procedural reasoning. Monolingual QA remains the easiest modality, while vision-language and code-heavy tasks exhibit substantial gaps.
6. Insights, Failure Modes, and Recommendations
Analysis of HiSciBench results yields several structural insights:
- Multimodal Fusion Bottleneck (L2–L3): Vision–LLMs trail text-only models on translation and QA; joint pretraining on document layout, formulas, and text is recommended.
- Citation Hallucination (L4): Fluent literature reviews contain many fabricated references (verifiability <20%); integration of retrieval-augmented pipelines or grounding modules is essential.
- Procedural Reasoning Gap (L5): Syntactically valid code often fails scientific criteria; integrating domain-specific tool-use skills (e.g., geospatial libraries, time-series analysis) and automated verification is advised.
- Cross-lingual Robustness: Vision-language QA achieves ∼86% accuracy, but text-only cross-lingual QA is limited to 60–70%. Enhanced multilingual pretraining and fine-tuning are needed.
- Hierarchical Training and Diagnostics: Explicit dependency modeling can inform multi-stage curricula (e.g., refining OCR before QA, validating QA for reviews, checking reviews before experiments).
These findings provide actionable guidance for future foundation model design. An implication is that training regimes mirroring task hierarchy and dependencies may foster more reliable, comprehensive scientific intelligence in models.
7. Role and Outlook of HiSciBench in Model Development
HiSciBench represents the first unified, interpretable standard for assessing scientific reasoning in foundation models from perception to innovation. By benchmarking distinct cognitive abilities in an integrated framework, it enables the identification of performance bottlenecks, diagnostic tracing of inter-stage failures, and targeted improvement strategies. The diversity of disciplines, modalities, and languages embedded in HiSciBench supports broad generalization testing, while its granularity informs both model developers and methodologists seeking robust, discovery-capable AI systems.
Public release of HiSciBench is anticipated to facilitate benchmarking and progress in reliable, multimodal, cross-lingual scientific foundation models, fostering transparent assessment and accelerating advancements in model intelligence and utility for authentic scientific work (Zhang et al., 28 Dec 2025).