UltraDomain Benchmarks: Cross-Domain SciML

Updated 22 November 2025

UltraDomain Benchmarks are integrated SciML benchmarks with unified specifications ensuring standardized evaluations across disciplines.
They define a 5-tuple structure (P, D, M, S, E) providing clear problem statements, FAIR datasets, reproducible metrics, reference solutions, and precise execution environments.
The benchmarks facilitate fair cross-domain comparisons through a rigorous six-category rating rubric and extendible taxonomy for emerging scientific challenges.

UltraDomain Benchmarks are a class of scientific machine learning (SciML) benchmarks defined to span multiple traditional scientific disciplines while enforcing unified specifications, data formats, and evaluation protocols. They are engineered to simultaneously stress both domain-specific learning tasks and cross-cutting AI/ML and systems motifs, addressing the need for reproducible and standardized cross-domain benchmarking in scientific machine learning (Hawks et al., 6 Nov 2025). This construct advances prior efforts by enabling fair, system-aware algorithm comparisons across disciplines such as physics, chemistry, materials science, biology, and climate science.

1. Formal Definition and Multi-Level Specification

An UltraDomain Benchmark (UDB) is formally defined by a 5-tuple:

$\mathrm{UDB} \equiv (P, D, M, S, E)$

where:

$P$ = problem statement with explicit domain constraints (inputs, outputs, invariants)
$D$ = dataset (FAIR-compliant, canonical splits, versioned)
$M$ = $\{m_1, ..., m_k\}$ , a set of rigorously defined and reproducible performance metrics
$S$ = reference solution (codebase, trained weights, hardware bill-of-materials)
$E$ = execution environment specification (containerization or precise package versions)

This abstraction states that every UDB must instantiate:

A scientific-level specification (problem, input modality, ground-truth, constraints such as conservation laws),
An application-level protocol (dataset splits, metrics, reference solution),
A system-level definition (environment, hardware-aware measurements, system constraints like power or latency budgets).

2. The MLCommons Science Benchmarks Ontology: System Architecture

The UltraDomain Benchmark infrastructure is embedded within the MLCommons Science Benchmarks Ontology, which consists of four principal subsystems:

Subsystem	Function	Key Features
Submission Portal & Ingestion	Structured metadata capture, auto-validation	FAIR-compliance checks, domain/motif tags
Review & Rating Engine	Six-category scoring, coordination by MLCommons Science WG	Uniform scoring rubric, endorsement threshold
Ontology & Taxonomy Database	Storage of UDBs, metadata, hierarchical classification	Scientific domains, AI/ML/computing motifs
User Interface & Selection Engine	REST API/web interface, cluster-based benchmark recommendation	N-dimensional feature clustering

Data flow proceeds from structured submission through automated validation and human review/rating, into the taxonomy database, before exposure via the user interface and programmatic API.

3. Six-Category Rating Rubric and Quality Control

Each UDB is scored on a six-dimensional rubric (0–5 per category) to enforce rigorous cross-domain standards:

Software Environment (completeness, documentation, containerization)
Problem Specification & Constraints (task definition, system boundaries)
Dataset (FAIR Principles: Findable, Accessible, Interoperable, Reusable)
Performance Metrics (formal metric definition and axis completeness)
- E.g., MSE: $\frac{1}{n} \sum_i (y_i - \hat{y}_i)^2$
- ROC-AUC: $\int_0^1 TPR(FPR^{-1}(t)) dt$
Reference Solution (code, reproducibility, full metric reporting, full hyperparameter/config detail)
Documentation (scientific background, motivation, protocol, publication)

A mean score above 4.5/5 qualifies a benchmark for “MLCommons Science Endorsement.” The uniform rubric guarantees comparability whether the submission is, for example, a catalytic chemistry generator or a turbulence surrogate model.

4. Taxonomy: Scientific Domains, AI/ML Motifs, and Computing Motifs

UltraDomain Benchmarks are classified along three orthogonal axes:

Scientific Domain: Physics, Chemistry, Materials Science, Biology, Climate Science
- E.g., Physics: $f : X \to \{\mathrm{signal}, \mathrm{background}\}$ (jet classification)
- Chemistry: $f:\mathrm{Graph} \to \mathbb{R}$ (adsorption energy regression)
- Materials: $f:\mathrm{Crystal} \to \mathbb{R}$ , MAE (band-gap prediction)
AI/ML Motif: Classification, Regression, Sequence Forecasting, Surrogate Modeling, Generative Modeling, Multimodal Reasoning, Anomaly Detection, Reinforcement Learning, Reasoning/Generalization
Computing Motif: Latency-Bound, Memory-Bound, Throughput-Bound, Utilization-Bound

Each UDB is tagged $(d, m, c)$ , providing a standardized descriptor. For example, the Open Catalyst Project is labeled (Chemistry, Regression, Throughput-Bound).

5. Submission, Review, and Recommendation Workflow

New benchmarks are submitted via a portal comprising structured metadata and undergo automated checks (including dataset FAIR-compliance and schema-validated metadata). Human review is performed by the MLCommons Science Working Group, leveraging the six-category rating rubric, and results are stored in the ontology database.

A user interface (https://mlcommons-science.github.io/benchmark/) and programmatic API support selection and recommendation. Clustering-and-recommendation systems generate N-dimensional feature vectors for each benchmark solution—including rubric ratings and profiler metrics such as mean power usage, utilization, latency, and throughput. Hierarchical clustering (cosine distance) is applied to identify subsets best representing user-stated preferences (e.g., prioritizing low latency and high ROC-AUC).

6. Extensibility, Hardware-Aware Analysis, and Gap-Driven Expansion

The ontology is designed for extensibility to emerging scientific domains and AI/ML motifs. Continuous ingestion of hardware-profiler data allows automated clustering of benchmarks:

$v = [r_1, \ldots, r_6, \bar{\mathrm{power}}, \bar{\mathrm{utilization}}, \mathrm{latency}, \mathrm{throughput}]$

where $r_i$ are rubric category scores. Hierarchical clustering by cosine distance identifies natural groupings (e.g. "low-power quantum chemistry surrogates" vs. "high-throughput climate forecast transformers"). Analysis of feature-space gaps drives targeted calls for under-represented benchmarks (e.g., geophysical inverse problems or quantum materials discovery), creating a feedback loop for continuous ontology refinement.

7. Best Practices for Standardization and Reproducibility

To ensure scalability and reproducibility, rigorous guidelines are mandated:

Full specification of 5-tuple $(P, D, M, S, E)$
FAIR datasets: persistent identifiers, JSON–LD metadata, canonical splits via manifest
Formal metric definitions (LaTeX/code), explicit system constraints (hardware, power/latency/throughput)
Reference solution containerization (Docker/Singularity, reproducible runs via docker run)
Modular code (data loader, model, training, evaluation) with a single YAML/JSON config
Exhaustive documentation, including command line examples and publication links
Version control and continuous integration for integrity checking
Scalable design (mini sets for rapid validation of large benchmarks)

This suite of requirements enables the scientific ML community to leverage UltraDomain Benchmarks for standardized, scalable, and reproducible cross-domain studies, supporting reliable algorithm and system-level comparison across the full spectrum of scientific workloads (Hawks et al., 6 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

An MLCommons Scientific Benchmarks Ontology (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to UltraDomain Benchmarks.