Papers
Topics
Authors
Recent
2000 character limit reached

HiSciBench: Multidisciplinary Science Benchmark

Updated 4 January 2026
  • HiSciBench is a multi-disciplinary benchmark assessing scientific intelligence through a five-level hierarchy that mirrors the research process.
  • It integrates multimodal inputs—text, equations, images, and tables—across six core disciplines with 8,735 curated instances to diagnose model capabilities.
  • Tailored evaluation metrics reveal performance gaps in citation accuracy, multimodal fusion, and procedural reasoning, guiding future AI model improvements.

HiSciBench is a hierarchical, multi-disciplinary benchmark designed to assess the scientific intelligence of LLMs and multimodal foundation models. It mirrors the real-world scientific workflow, ranging from basic factual recall to hypothesis-driven discovery. HiSciBench uniquely integrates five levels (Scientific Literacy, Literature Parsing, Literature-based Question Answering, Literature Review Generation, Scientific Discovery) and supports multimodal inputs—including text, equations, images, figures, and tables—across six core disciplines: mathematics, physics, chemistry, biology, geography, and astronomy. With 8,735 curated instances, HiSciBench enables granular, dependency-aware diagnosis of model capabilities and failure modes throughout the research pipeline (Zhang et al., 28 Dec 2025).

1. Hierarchical Structure of Scientific Reasoning

HiSciBench operationalizes scientific intelligence as a five-level hierarchy, each capturing a distinct stage in the research process:

  • L1: Scientific Literacy (Perceive). Models must recall and explain fundamental concepts. Tasks include answering general multiple-choice questions (e.g., "What is the conservation of momentum?").
  • L2: Literature Parsing (Parse). Models extract and translate the content of scientific documents. Subtasks: Document OCR and Parsing (producing Markdown including formulas from PDFs or figures) and Cross-lingual Translation (preserving technical semantics between languages).
  • L3: Literature-based Question Answering (Reason). Models perform deep inference within scientific documents. Subtasks split into Monolingual QA (fine-grained questions on single papers) and Cross-lingual QA (questions about English sources posed in other languages).
  • L4: Literature Review Generation (Synthesize). Given a topic and a set of core papers, models synthesize a coherent, critical survey, integrating findings and providing analytical context.
  • L5: Scientific Discovery (Innovate). Models design and execute computational experiments using structured data, infer patterns, and propose novel hypotheses (e.g., predicting catalysts, analyzing climate data).

Each level is dependency-aware: outputs from each earlier stage feed into the subsequent stage, structurally modeling the cascade of real scientific work.

2. Dataset Composition and Multi-disciplinary Scope

The benchmark comprises 8,735 instances constructed from authentic scientific papers across six disciplines:

Level Subtask/Type Instances Domain Coverage
L1.1 Multiple-choice QA 1,200 200 per discipline
L2.1/2.2 OCR/Parsing & Translation 629 + 629 Math 208, Physics 357, Astronomy 19, Biology 45
L3.1 Monolingual QA 5,514 Math 821, Physics 1,025, Chemistry 886, Astronomy 330, Geography 500, Biology 1,952
L3.2 Cross-lingual QA 629 See L2 for source pages
L4.1 Topic-guided Reviews 60 10 per discipline
L5.1 Data-driven Discovery 74 Chemistry 20, Geography 27, Biology 27

Disciplinary balance is achieved: biology (26.6%), physics (26.4%), chemistry (≈14%), mathematics (≈12%), geography (≈11%), astronomy (≈9%). 84.7% of instances use multimodal inputs (image-text pairs, equations, tables); cross-lingual evaluation is embedded in L2.2 and L3.2.

3. Evaluation Methodologies and Metrics

HiSciBench employs metrics tailored to each task:

  • L1 & L3 (QA): Accuracy, defined as #correct#total×100%\frac{\#\text{correct}}{\#\text{total}} \times 100\%.
  • L2.1 (OCR): Word-level Accuracy, correct tokensall tokens\frac{\text{correct tokens}}{\text{all tokens}}.
  • L2.2 (Translation): BLEU, BLEU=BPexp(n=1Nwnlogpn)\text{BLEU} = \text{BP}\exp(\sum_{n=1}^N w_n\log p_n).
  • L4 (Reviews): Content Quality (five LLM-judge criteria, each rated 1–5: Coverage, Structure, Relevance, Synthesis, Critical Analysis) and Citation Quality (Verifiability Rate, Metadata Accuracy, Faithfulness Rate, Citation Count, Source Count, Source Distribution Entropy, Recency Rate).
  • L5 (Discovery): Success Rate (SR): SR=#tasks successfully executed#tasks×100%\text{SR} = \frac{\#\text{tasks successfully executed}}{\#\text{tasks}} \times 100\%; "success" requires code to run error-free and meet scientific output criteria.

These multi-faceted metrics enable both quantitative aggregation and in-depth analysis of model abilities and weaknesses at each cognitive level.

4. Integrated, Dependency-Aware Workflow Modeling

HiSciBench is designed with explicit inter-level dependencies:

  • D → (L2) T → (L3) A → (L4) R → (L5) C
    • D = raw PDFs/images
    • T = structured document parses or translated text (output of L2)
    • A = answers generated from parsed documents and question sets (output of L3)
    • R = literature review synthesized from documents and answers (output of L4)
    • C = code/hypothesis generation (output of L5)

Earlier stages supply structured data and extracted knowledge to downstream modules; for example, OCR and translation outputs directly seed QA tasks, reviewed answers inform synthesis, and gaps mapped by review guide computational experimentation.

A plausible implication is that explicit modeling of these dependencies exposes composite failure modes, such as erroneous parsing propagating to incorrect hypothesis generation.

5. Model Performance and Diagnostic Findings

Evaluation results for leading models such as GPT-5 and Deepseek-R1 reveal distinct capability gradients:

Level Top Model(s) Metric Value(s) Discipline/Dataset Gaps
L1 GPT-5 Accuracy 69.17% Math ≈ 84%, Biology ≈ 50%
L2.1 GPT-5 BLEU (Parsing) 67.61 Qwen3-VL-8B: 64.76; Intern-VL3.5-8B: 9.53
L2.2 GPT-5 BLEU (Translation) 43.29 S1-Base-Pro-32B: 41.28
L3.1 Deepseek-v3 Accuracy 96.20% Vision-language fragment: 80.45%; GPT-5: 76.75%
L3.2 GPT-5 Accuracy 86.28% (VL) Deepseek-R1: 79.83% (VL); 67.53% (text-only)
L4 GPT-5 Content Quality 4.99 (out of 5) Citation verifiability: SurveyX 71.4% va GPT-5 19.3%
L5 GPT-5 Success Rate 24.75% Deepseek-R1: 21.05%

Models attain high accuracy in basic QA (L3), but performance sharply declines in scientific literacy (L1) and data-driven discovery (L5). Notably, less than 20% of citations generated by general LLMs in literature reviews are verifiable; syntactically correct code produced in L5 often fails scientific validation.

This suggests persistent bottlenecks in multimodal fusion, grounding, citation integrity, and procedural reasoning. Monolingual QA remains the easiest modality, while vision-language and code-heavy tasks exhibit substantial gaps.

6. Insights, Failure Modes, and Recommendations

Analysis of HiSciBench results yields several structural insights:

  • Multimodal Fusion Bottleneck (L2–L3): Vision–LLMs trail text-only models on translation and QA; joint pretraining on document layout, formulas, and text is recommended.
  • Citation Hallucination (L4): Fluent literature reviews contain many fabricated references (verifiability <20%); integration of retrieval-augmented pipelines or grounding modules is essential.
  • Procedural Reasoning Gap (L5): Syntactically valid code often fails scientific criteria; integrating domain-specific tool-use skills (e.g., geospatial libraries, time-series analysis) and automated verification is advised.
  • Cross-lingual Robustness: Vision-language QA achieves ∼86% accuracy, but text-only cross-lingual QA is limited to 60–70%. Enhanced multilingual pretraining and fine-tuning are needed.
  • Hierarchical Training and Diagnostics: Explicit dependency modeling can inform multi-stage curricula (e.g., refining OCR before QA, validating QA for reviews, checking reviews before experiments).

These findings provide actionable guidance for future foundation model design. An implication is that training regimes mirroring task hierarchy and dependencies may foster more reliable, comprehensive scientific intelligence in models.

7. Role and Outlook of HiSciBench in Model Development

HiSciBench represents the first unified, interpretable standard for assessing scientific reasoning in foundation models from perception to innovation. By benchmarking distinct cognitive abilities in an integrated framework, it enables the identification of performance bottlenecks, diagnostic tracing of inter-stage failures, and targeted improvement strategies. The diversity of disciplines, modalities, and languages embedded in HiSciBench supports broad generalization testing, while its granularity informs both model developers and methodologists seeking robust, discovery-capable AI systems.

Public release of HiSciBench is anticipated to facilitate benchmarking and progress in reliable, multimodal, cross-lingual scientific foundation models, fostering transparent assessment and accelerating advancements in model intelligence and utility for authentic scientific work (Zhang et al., 28 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to HiSciBench.