The AnIML Ontology: Enabling Semantic Interoperability for Large-Scale Experimental Data in Interconnected Scientific Labs

Published 2 Apr 2026 in cs.AI | (2604.01728v1)

Abstract: Achieving semantic interoperability across heterogeneous experimental data systems remains a major barrier to data-driven scientific discovery. The Analytical Information Markup Language (AnIML), a flexible XML-based standard for analytical chemistry and biology, is increasingly used in industrial R&D labs for managing and exchanging experimental data. However, the expressivity of the XML schema permits divergent interpretations across stakeholders, introducing inconsistencies that undermine the interoperability the AnIML schema was designed to support. In this paper, we present the AnIML Ontology, an OWL 2 ontology that formalises the semantics of AnIML and aligns it with the Allotrope Data Format to support future cross-system and cross-lab interoperability. The ontology was developed using an expert-in-the-loop approach combining LLM-assisted requirement elicitation with collaborative ontology engineering. We validate the ontology through a multi-layered approach: data-driven transformation of real-world AnIML files into knowledge graphs, competency question verification via SPARQL, and a novel validation protocol based on adversarial negative competency questions mapped to established ontological anti-patterns and enforced via SHACL constraints.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper introduces a novel expert-in-the-loop approach using LLM-assisted competency question extraction to formalize and validate experimental data semantics.
It presents a modular ontology framework with core and technique modules that ensure scalable, robust integration of heterogeneous experimental data.
The evaluation, featuring SPARQL queries and SHACL negative tests, demonstrates its effectiveness in mitigating semantic drift and data inconsistencies.

The AnIML Ontology: Semantic Interoperability for Experimental Data in Scientific Labs

Motivation and Problem Statement

The proliferation of heterogeneous scientific data sources and distributed experimental systems has rendered semantic interoperability a critical barrier in computational science. The Analytical Information Markup Language (AnIML) emerged as a de facto XML-based standard for encoding analytical chemistry and biology data. However, the flexibility of the XML schema has proven problematic: divergent practices across stakeholders induce inconsistencies, which fundamentally undermine reliable data integration and automated reasoning. Existing approaches to ontology extraction from XML for such domains fail to formalize the semantics of reusable core entities and context-dependent data linking.

Ontology Engineering Methodology

To overcome the ambiguity inherent in AnIML's schema and enable robust, cross-system interoperability, the AnIML Ontology leverages an expert-in-the-loop engineering pipeline. The process synergistically combines LLM-assisted requirements elicitation with collaborative ontology development:

LLM-Assisted Competency Question (CQ) Extraction: The schema and associated documentation are processed using Gemini 2.5 Pro, generating candidate CQs that reflect intended data modeling requirements.
Expert Validation and Iteration: Domain experts validate, refine, and accept/reject CQs, with rejected items triggering prompt-based guided reformulation cycles.
Ontology Design Patterns (ODPs): The ontology heavily reuses and extends established ODPs (e.g., Set, Sequence, Situation), enabling modularity and compatibility with foundational ontologies such as BFO and DOLCE.
Graphical Modeling: Graffoo notation supports interdisciplinary consensus in module definition and relationship scoping.

Through this approach, over 100 expert-validated CQs were formalized, ensuring precise coverage of key experimental data abstractions.

Ontology Structure and Content

The AnIML Ontology (v1.1.0, OWL 2) is partitioned into two main modules—core and technique—and introduces the AnIML Reference Pattern for explicit linked data modeling.

Core Module

The core encapsulates structural aspects of experimental records:

aml:Document aggregates experimental workflows, sample sets, provenance/audit trail, and signatures.
Sample Management: Employs the Set ODP for representing complex, hierarchical sample/container arrangements, and the Situation pattern to handle physical and logical groupings.
Data Containers: The aml:Category pattern uniformly annotates experimental parameters (aml:Parameter) and multimodal data series (aml:Series), supporting consistent extensibility.
Experiment Workflow: The Sequence ODP structures aml:Experiment and its ordered aml:ExperimentStep constituents, capturing methods, responsible agents, infrastructure, and results.
Provenance and Compliance: A versioned audit trail (aml:AuditTrail) rigorously tracks atomic changes, document states, and agent attributions, mapped explicitly within the class hierarchy.
Figure 1: The AnIML technique module, modeling blueprint definitions, constraint imposition, and metadata for methods in the ontology.

Technique Module

Technique-specific definitions (i.e., ATDDs in AnIML) are handled as formal blueprints:

aml:Technique details the analytical method, versioning, modular hierarchical inheritance, and extensibility via aml:extends.
Specification Classes: Fine-grained control of data types, units, array structures, and visualization constraints is provided by a suite of Specification subclasses (aml:ParameterSpecification, aml:SeriesSpecification, aml:DataPointSpecification).
Encoding Semantics: Explicit modeling of data encoding, binary representations, and dependencies supports instrument-agnostic interpretation and visualization.

AnIML Reference Pattern

XML’s ID/IDREF mechanism for indirect references is insufficient for semantic reasoning. The ontology introduces the AnIML Reference Pattern as an ODP to explicitly model references as first-class entities (aml:AnimlReference):

Instances link with aml:pointsTo, carrying context via aml:hasFunction (role semantics) and aml:startValue/aml:endValue (array slicing semantics).
Reference sets (aml:AnimlReferenceSet) are strictly scoped to a subject container, ensuring contextually-valid linking.
Figure 2: The AnIML reference pattern, elevating ID/IDREF-style links into explicit, context-rich, first-class semantic entities.

Ontology Alignment and Interoperability

Cross-standard interoperability is operationalized via mappings to the Allotrope Foundation Ontologies (AFO), leveraging state-of-the-art ontology alignment systems (LogMap, BERTMap, DeepOnto, OntoAligner):

Scalable Coverage: Instead of exhaustive class-level equivalence, the focus is on pragmatic alignments underpinning experimental data integration.
Curation and Relaxation: Automatically discovered equivalences are curated with manual SKOS redefinitions for semantically looser correspondences.
Numerical Evaluation: Out of 176 candidate alignments, 121 were certified as strict equivalences and 20 as SKOS matches, producing a curated set of 47 unique crosswalks.

Evaluation and Validation Strategy

Evaluation rigorously checks both representational adequacy and unintended model exclusion.

Structural/Logical Verification: OOPS! is applied for anomaly detection; model refinements guarantee axiomatic consistency.
Knowledge Graph Instantiation: Industrial AnIML files (anonymized) are transformed to test instance data fidelity; successful AFO-aligned population demonstrates practical integrability.
CQ Formulation (Positive): 40 CQs are translated into SPARQL queries confirming the ontology’s ability to answer core domain requirements.
Adversarial Negative CQs: A notable methodological advance is the generation of 20 SHACL-enforced negative CQs, targeting empirical anti-pattern categories (e.g., role/type incompatibility, scope misalignment, cycle violations). Each negative requirement includes designed test datasets, mechanistically validating that the ontology excludes undesired states—an approach rarely systematized in ontology engineering for experimental science.

Implications and Future Directions

The AnIML Ontology delivers a formal, extensible, and modular framework for transforming large, heterogeneous scientific data silos into knowledge graphs enabling semantic interoperability. The design promotes direct mapping from legacy AnIML XML, supporting both rapid adoption in industrial environments and rigorous constraint validation. The explicit modeling of references and provenance is essential in automated scientific workflows and compliance-oriented institutional contexts.

On the theoretical side, the application of LLMs as requirement elicitors within iterative expert-validation cycles presents a template for ontology engineering in data-rich, documentation-heavy domains. The adversarial CQ protocol sets a new standard for validation—directly targeting real sources of semantic drift and unintended modeling.

Future developments will prioritize richer integration with established ontologies for processes, agents, units, and provenance, extend crosswalks to OBO/BioPortal standards, and expand validation as new use cases and stakeholder requirements emerge. This work exemplifies how semantic technologies—when paired with expert/AI-in-the-loop engineering and rigorous model testing—can support next-generation autonomous scientific laboratories and data-driven discovery.

Conclusion

The AnIML Ontology sets a technical foundation for scalable, robust semantic interoperability of experimental data in modern scientific laboratories. It systematically addresses the shortcomings of XML-based syntactic standards, operationalizes advanced ontology alignment and validation strategies, and directly serves the interoperability needs of both industrial and academic research environments.

Markdown Report Issue