Hydro-SE Bench: Domain-Specific LLM Evaluation
- Hydro-SE Bench is a specialized framework that assesses LLM performance in Hydro-Science and Engineering across nine subfields and three cognitive dimensions.
- It employs 4000 precisely designed multiple-choice questions to evaluate fundamental concepts, applied reasoning, and calculation skills in realistic engineering scenarios.
- Results show commercial LLMs score 0.74–0.80 and small models 0.41–0.68, highlighting the need for domain-adaptive training and robust evaluation methods.
The Hydro-SE LLM evaluation benchmark (Hydro-SE Bench) is a large-scale, domain-specific framework for systematically evaluating the knowledge and reasoning capabilities of LLMs in Hydro-Science and Engineering (Hydro-SE) applications. Developed to address the inadequacy of generic benchmarks in capturing the complexity and criticality of Hydro-SE, Hydro-SE Bench offers a fine-grained assessment across core subfields, three cognitive levels, and diverse engineering scenarios. Its methodological rigor and comprehensive coverage establish it as a foundational resource for quantitative model analysis, targeted model adaptation, and domain-specific deployment guidance (Hu et al., 3 Dec 2025).
1. Scope and Objectives
Hydro-Science and Engineering constitutes a multidisciplinary field underpinning water security, hydropower generation, flood/drought mitigation, and ecological stewardship. The integration of scientific knowledge (hydrology, meteorology, geotechnics, hydraulics) with engineering expertise (structure design, power systems, risk management) is essential for robust decision-making but introduces evaluative complexity due to the field's vast domain-specific lexicon, frequently updated standards, and scenario-based application requirements.
Hydro-SE Bench was designed with the following objectives:
- Quantify LLM competence in Hydro-SE–specific knowledge, reasoning, and calculation tasks, extending beyond generic scientific capability.
- Reveal strengths and weaknesses in LLM performance at both the subfield and cognitive level granularity.
- Provide actionable guidance for future domain-adaptive training, supervised fine-tuning, and safe practical deployment.
2. Benchmark Structure and Content
Hydro-SE Bench encompasses 4,000 multiple-choice questions (MCQs) that comprehensively span nine secondary subfields and three cognitive dimensions:
| Subfield | Example Topics | Distribution by Cognitive Level |
|---|---|---|
| Background Knowledge (BK) | Water-cycle terms, resource facts | A ≈ 34%, B ≈ 33%, C ≈ 33% |
| Industry Standards (IS) | Regulatory codes, flood zone classification | A ≈ 30%, B ≈ 35%, C ≈ 35% |
| Hydrology and Water Resources (HWR) | Infiltration, runoff, flood models | A ≈ 36%, B ≈ 31%, C ≈ 33% |
| Geotechnical Engineering (GE) | Soil–foundation, seepage, filters | A ≈ 32%, B ≈ 34%, C ≈ 34% |
| Hydraulic Structures and Equipment (HSE) | Dams, turbines, penstocks | A ≈ 31%, B ≈ 33%, C ≈ 36% |
| Engineering Safety and Management (ESM) | Supervision, testing, risk management | A ≈ 33%, B ≈ 34%, C ≈ 33% |
| Hydraulics and River Dynamics (HRD) | Open-channel flow, sediment transport | A ≈ 35%, B ≈ 32%, C ≈ 33% |
| Meteorology (M) | Dew-point, monsoons, subtropical highs | A ≈ 34%, B ≈ 33%, C ≈ 33% |
| Power System (PS) | Short-circuit, converters, generator efficiency | A ≈ 36%, B ≈ 31%, C ≈ 33% |
- Single-choice questions (SCQ): 2,700 (68%)
- Multi-choice questions (MCQ): 1,300 (32 %)
- Cognitive levels:
- Type A: Basic conceptual knowledge
- Type B: Scenario-based engineering application
- Type C: Reasoning and calculation
This distribution ensures each subfield is evaluated in foundational knowledge, applied context, and quantitative reasoning. Each question was crafted to be representative, precise, and relevant to current Hydro-SE practice.
3. Sample Questions and Domain Formulations
Hydro-SE Bench incorporates questions at all cognitive levels in each subfield. Representative samples include:
- Basic Conceptual Knowledge (Type A, HWR):
"During the hydrological cycle, which process serves as the primary linkage between surface water and groundwater, and is strongly influenced by soil type, vegetation, and rainfall intensity?" (Correct answer: Infiltration)
- Reasoning & Calculation (Type C, HRD):
"In calculating the water-surface profile of a natural river reach with discharge , cross-sectional area and wetted perimeter , what is the average hydraulic radius ?" (Correct answer: )
Core formulas integral to Type C questions exemplify the mathematical rigor required for accurate evaluation:
- Continuity equation (steady flow):
- Hydraulic radius:
- Bernoulli’s equation:
- Turbine power output: ,
- Short-circuit conversion:
4. Evaluation Methodology
Sixteen LLMs, comprising 10 commercial large-parameter models (30B–100B+ parameters) and 6 small-parameter open-source models (7B–72B), were benchmarked. The evaluation pipeline consisted of:
- Prompting: Deterministic response generation (temperature = 0), requiring explicit reasoning for each answer.
- Answer Extraction: Employing a secondary LLM to extract choice letters from the generated content.
- Scoring: Exact match comparison against a gold-standard answer key.
Metrics include:
- Accuracy: , computed overall, by subfield, and by cognitive level.
- Confidence calibration: Models self-reported ordinal confidence (1–5) to generate calibration curves for reliability assessment.
- Sampling stability: Subsamples of question sets (10–100%, 10 iterations) measured benchmark robustness (variation within ±2% for samples).
5. Performance Results and Scaling Effects
Commercial large-parameter LLMs achieved overall accuracies in the range 0.74–0.80, while small-parameter open-source LLMs scored 0.41–0.68. Subfield-level performance exhibited pronounced variation:
| Subfield | Commercial LLM Mean Accuracy | Small-parameter LLM Mean Accuracy |
|---|---|---|
| Power System (PS) | ≈ 0.83 | |
| Meteorology (M) | ≈ 0.79 | |
| HRD | ≈ 0.79 | |
| Background Knowledge | ≈ 0.70 | |
| Industry Standards | ≈ 0.70 | |
| ESM | ≈ 0.71 | |
| HSE | < 0.75 |
Model scaling produced the following effects:
- Reasoning & calculation (Type C): Accuracy improved from 0.50 (small models) to 0.78 (large models), a 56% relative gain.
- Basic concept (A): 23% improvement.
- Engineering application (B): 33% improvement.
- Most significant subfield gains: HRD (+0.23), GE (+0.22).
- Least significant: IS (+0.09), ESM (+0.13).
LLMs exhibited highest accuracy in subfields sharing substantial overlap with natural and physical sciences (PS, M, HRD). Performance dropped in highly domain-specific or regulatory domains (BK, IS, ESM, HSE).
6. Strengths, Weaknesses, and Diagnostic Insights
Strengths:
- High recall of fundamental science and physics (e.g., open-channel hydraulics, turbine thermodynamics).
- Robust multi-step numerical reasoning when domain concepts are widely covered in training corpora.
- Consistency in fields with universal physical principles.
Weaknesses:
- Limited coverage of frequently updated industry standards and region-specific regulatory frameworks.
- Difficulty with specialized terminology and nuanced, context-driven methodological selection.
- Overconfidence in self-assessed correctness, increasing the risk of hallucination or misapplication, particularly at "high-confidence" outputs.
A plausible implication is that LLMs trained on general scientific corpora are insufficient for Hydro-SE professional deployment, especially for tasks requiring interpretive or judgment-intensive application of local industry norms (Hu et al., 3 Dec 2025).
7. Recommendations and Future Directions
Key recommendations for advancing Hydro-SE LLM performance include:
- Domain-adaptive pre-training: Curate current Hydro-SE corpora (textbooks, industrial codes, national standards) to correct for knowledge gaps and currency.
- Supervised expert annotation: Use expert-annotated question–answer datasets, focusing on engineering application and regulatory scenarios, to enhance contextual and reasoning skills.
- Integration of multimodal data: Leverage text, remote sensing imagery, time-series gauges, and interactive scenarios to simulate realistic engineering decision workflows.
- Metrics standardization: Develop metrics for domain reasoning quality, uncertainty quantification, and structured human–AI collaboration.
- Longitudinal benchmarking: Incorporate periodic re-evaluation to monitor progress, inform curation, and drive iterative refinement.
This suggests that robust, trustworthy LLM integration into Hydro-SE practice will depend on addressing both the breadth (knowledge coverage) and depth (applied and regulatory reasoning) of model capabilities through targeted benchmark-driven loops (Hu et al., 3 Dec 2025).
8. Positioning within the Broader Benchmarking Ecosystem
Hydro-SE Bench aligns with emerging best practices for domain-specific LLM evaluation, as exemplified by infrastructure such as AECBench for architecture, engineering, and construction (Liang et al., 23 Sep 2025). These benchmarks prioritize hierarchical cognitive frameworks, rigorous expert-driven dataset construction with multi-tier review, and multi-format evaluation (MCQ, scenario, extraction, generation).
A plausible implication is that the Hydro-SE Bench model—emphasizing cognitive axis stratification, subfield detail, and standardized evaluation—serves as an extensible template for benchmarking in similarly interdisciplinary, safety-critical domains.
References:
(Hu et al., 3 Dec 2025) Zhao et al., "Evaluating Hydro-Science and Engineering Knowledge of LLMs" (Liang et al., 23 Sep 2025) Liu et al., "AECBench: A Hierarchical Benchmark for Knowledge Evaluation of LLMs in the AEC Field"