MLCommons Science Benchmarks Ontology

Updated 13 November 2025

MLCommons Science Benchmarks Ontology is a community-driven framework that standardizes machine learning benchmarks across scientific domains.
It employs an OWL/RDF-based hierarchical taxonomy and a six-category rubric to ensure transparent evaluation and reproducible results.
The framework supports federated querying and open submissions, fostering extensibility and collaborative enhancements in emerging ML motifs.

The MLCommons Science Benchmarks Ontology is a standardized, community-driven framework for the organization, evaluation, and federated querying of machine learning benchmarks across diverse scientific domains. Developed to unify previously siloed initiatives such as XAI-BENCH, FastML Science Benchmarks, PDEBench, and SciMLBench, the ontology enforces rigorous classification, transparent rating, and reproducible workflows for cross-domain scientific benchmarking within the MLCommons ecosystem. Its scope encompasses physics, chemistry, materials science, biology, climate science, and earth sciences, providing a scalable foundation for both present and emerging scientific and computing motifs.

1. Formal Specification and Ontological Structure

At its core, the ontology is expressed using OWL/RDF, with a TBox capturing the essential class hierarchy, object properties, and metadata relations. The primary ontological concepts are:

Benchmark Classes:
- $\mathit{Benchmark}$ : base OWL class.
- $\mathit{ScientificBenchmark} \sqsubseteq \mathit{Benchmark}$
- $\mathit{ApplicationLevelBenchmark} \sqsubseteq \mathit{ScientificBenchmark}$
- $\mathit{SystemLevelBenchmark} \sqsubseteq \mathit{ScientificBenchmark}$
- Application/System benchmarks are disjoint: $\mathit{ApplicationLevelBenchmark}\sqcap \mathit{SystemLevelBenchmark}\equiv\bot$ .
Principal Object Properties:
- $\mathit{hasDomain}(\mathit{Benchmark}, \mathit{ScientificDomain})$
- $\mathit{hasAIMotif}(\mathit{Benchmark}, \mathit{AIMotif})$
- $\mathit{evaluatedBy}(\mathit{Benchmark}, \mathit{Framework})$
- $\mathit{extendsFramework}(\mathit{ScientificBenchmark},\mathit{Framework})$
- $\mathit{hasReferenceSolution}(\mathit{Benchmark}, \mathit{ReferenceSolution})$
- $\mathit{hasPerformanceMetric}(\mathit{Benchmark}, \mathit{PerformanceMetric})$
Metadata Fields (Datatype Properties):
- $\mathit{benchmarkName}$ , $\mathit{description}$ , $\mathit{version}$ (all strings)
- $\mathit{submissionDate}$ (xsd:dateTime)

Example OWL/RDF Individual:

:PDEBench a :ScientificBenchmark ;
  :benchmarkName "PDEBench" ;
  :hasDomain :ComputationalScience ;
  :hasAIMotif :SurrogateModeling ;
  :hasPerformanceMetric :L2Error ;
  :hasReferenceSolution :FNOBaseline ;
  :evaluatedBy :SciMLBenchFramework .

This formalism allows inferencing (e.g., any

\mathit{ApplicationLevelBenchmark}

is also a

\mathit{ScientificBenchmark}

), federated querying, and seamless interoperability with other MLCommons ontologies.

2. Hierarchical Taxonomy and Classification Scheme

Benchmarks are classified along three orthogonal axes:

Domain
Data Modality
Level (Application or System)

This is formally structured in a nested taxonomy:

\begin{itemize}
  \item ScientificBenchmark
    \begin{itemize}
      \item Domain
        \begin{itemize}
          \item Physics
            \begin{itemize}
              \item High-EnergyPhysics
                \begin{itemize}
                  \item ApplicationLevelBenchmark
                  \item SystemLevelBenchmark
                \end{itemize}
              \item ClimateScience
                \begin{itemize}
                  \item ApplicationLevelBenchmark
                  \item SystemLevelBenchmark
                \end{itemize}
            \end{itemize}
          \item Chemistry [...]
          \item Biology [...]
          \item MaterialsScience [...]
          \item EarthScience [...]
        \end{itemize}
      \item DataModality
        \begin{itemize}
          \item SimulationData
          \item ExperimentalData
          \item ObservationalData
          \item Multimodal
        \end{itemize}
      \item Level
        \begin{itemize}
          \item ApplicationLevelBenchmark
          \item SystemLevelBenchmark
        \end{itemize}
    \end{itemize}
\end{itemize}

Each benchmark is thus indexed by domain, motif, and level, supporting compositional filtering and extensible tagging. The scheme’s extensibility directly supports emerging motifs (e.g., “Quantum-ML”, “Astroinformatics”) by the addition of new individuals.

3. Six-Category Rating Rubric

Candidate benchmarks are systematically evaluated against a six-axis rubric, with each axis scored from 0 to 5 points:

Novelty: Assesses introduction of new scientific domain/sub-domain, motif, data modality, first combined motif, and motivation relative to prior art.
Reproducibility: Merges software environment and reference solution requirements, including public code, containerization, zero-touch execution, complete pipelines, and versioned dependencies.
Diversity: Considers canonical splits, scientific sub-domain coverage, difficulty scaling, input modalities, and parametric task variants.
Performance Relevance: Aggregates MetricDefinitionScore (0–3) and MetricQualityScore (0–2):

$r_{\text{perf}} = \text{MetricDefinitionScore} + \text{MetricQualityScore}$

Data Quality: Encodes FAIR principles (findable, accessible, interoperable, reusable) and explicit data splits.
Extensibility: Appraises submission templating, ontology anchoring, API/versioning, plugin capability, and community governance.

The overall rubric score:

$S = \sum_{i=1}^6 w_i r_i, \quad w_i = \frac{1}{6}, \quad S \in [0, 5]$

Benchmarks achieving $S \geq 4.5$ qualify for the “MLCommons-endorsed” status ( $mlcommons:Endorsed$ tag), receiving portal promotion and formal recognition.

4. Open Submission and Governance Workflow

Community contributions follow a standardized submission and review protocol:

Validation: Input metadata validated against schema (JSON-Schema).
Scoring: Rubric scoring computed automatically.
Provisional Review: Submissions with $S < 3.0$ are rejected for insufficient quality.
Working Group Review: MLCommons Science WG conducts expert panel evaluation.
Publication: Approved benchmarks are incorporated into the ontology, tagged if endorsed, and publicized accordingly.

Submission Data Model Example (abridged):

Field	Type	Validation Rule
id	string (UUID)	required, unique
title	string	non-empty
description	string	non-empty
submitter	{name,email}	valid email
domains	[uri]	must be ScientificDomain
aimotif	uri	must be AIMotif
level	enum	{Application, System}
metricDefs	array	must have name and formula
rubricScores	[0..5]	length=6, float

This process is designed for transparency and reproducibility, leveraging a public submission template and web UI for live score feedback and iteration.

5. Querying and Real-World Use Cases

The OWL/RDF architecture enables advanced federated querying via SPARQL. Examples include:

Endorsed Climate-Science Regression Benchmarks:

PREFIX mc: <http://mlcommons.org/ontology#>
SELECT ?b ?name WHERE {
  ?b a mc:ScientificBenchmark ;
     mc:hasDomain mc:ClimateScience ;
     mc:hasAIMotif mc:Regression ;
     mc:mlcommons:Endorsed true ;
     mc:benchmarkName ?name .
}

System-Level Benchmarks Extending SciMLBench:

SELECT ?b ?mDef WHERE {
  ?b a mc:SystemLevelBenchmark ;
     mc:extendsFramework mc:SciMLBenchFramework ;
     mc:hasPerformanceMetric ?m .
  ?m mc:metricFormula ?mDef .
}

Ranking Candidates by Rubric Score:

SELECT ?b ?S WHERE {
  ?b rdf:type mc:Benchmark ;
     mc:rubricScore ?S .
}
ORDER BY DESC(?S)
LIMIT 10

In routine practice, system researchers filter by system-level benchmarks and high-latency motifs, while domain scientists retrieve benchmarks by scientific domain and specific metrics such as accuracy. This suggests the ontology’s multifaceted filtering supports both technical benchmarking and scientific inquiry needs.

6. Evolution, Extensibility, and Community Practices

The ontology’s extensible structure accommodates the organic growth of scientific and AI/ML motifs. Novel motifs (e.g., “Neuromorphic”, “EdgeML”) are instantiated by adding individuals to the relevant classes and tagging them in new submissions, governed by ongoing MLCommons oversight. A plausible implication is sustained compatibility with emerging ML and computing paradigms.

Best practices recommend use of the published schema, thorough rubric self-evaluation, and engagement with community governance for roadmap and plugin extensions. Endorsement is algorithmic, promoting only highly rated, reproducible, FAIR, and extensible benchmarks for adoption.

7. Context and Significance

The MLCommons Science Benchmarks Ontology responds to fragmentation in scientific ML benchmarking, offering an authoritative mechanism for standardization, cross-spectrum analysis, and reproducible research impact. It bridges disciplinary silos, ties frameworks to shared semantic anchors, and ensures transparent governance. Its adoption underpins the prioritization of high-quality benchmarks and facilitates the identification of emergent computing patterns and motifs unique to scientific workloads. As such, the ontology serves as a reference architecture and living standard underpinning reproducible, scalable, and impactful ML benchmarking across the sciences.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to MLCommons Science Benchmarks Ontology.