Minimally-Biased Scientific Ontologies
- Minimally-biased scientific ontologies are structured representations that intentionally reduce subjective theoretical, linguistic, and cultural biases.
- They employ systematic frameworks like Just Enough Ontology Engineering, algebraic operations, and iterative expert feedback for transparent model refinement.
- Their design principles support reproducible data sharing and interdisciplinary collaboration, validated by benchmarking and diverse real-world applications.
A minimally-biased scientific ontology is an explicit, formally structured representation of a domain of scientific knowledge that intentionally minimizes the influence of particular theoretical, methodological, linguistic, or social biases. It achieves this through systematic methodology, broad stakeholder involvement, judicious selection of foundational models, implementation independence, transparent documentation, and architecture and process designs that resist unnecessary complexity or presupposed assumptions.
1. Sources and Taxonomy of Ontological Bias
Ontological bias arises when the structure, language, or constraints of an ontology reflect the subjective choices, perspectives, or limitations of its designers. Eight categories of bias are recognized (Keet, 2021):
- Philosophical bias: Selection of foundational or top-level ontologies (e.g., BFO, DOLCE) imposes explicit metaphysical commitments.
- Purpose bias: Tailoring to specific applications, such as compact data models (Pattern B) or literature annotation (Pattern C), privileges certain use cases over others.
- Science bias: Preference is given to particular scientific theories, as in the disputed “Virus ⊑ Organism” hierarchy.
- Granularity bias: Choices in modeling level of detail (aggregation/omission of subtypes) may be explicit or implicit.
- Linguistic bias: Terminology and language may favor a linguistic or cultural community.
- Socio-cultural bias: Assumptions about social organization may be built into relationships (e.g., defining family infection only as “spouse”).
- Political/religious bias: Ontological terms can reflect political, religious, or ideological slants (e.g., “terrorist organisation”).
- Economic bias: Classifications that serve economic agendas, such as defining obesity as a disease, may be included.
The main sources of these biases include foundational commitments, application-driven requirements, domain-specific conventions, and the composition and perspective of the developer team.
2. Methodologies and Frameworks for Bias Minimization
Modern efforts to reduce ontological bias employ a combination of process methodologies, stakeholder management, formalization, and tool support.
- “Just Enough” Ontology Engineering (JEOE) (Maio, 2011): Advocates a lightweight process with iterative refinement, explicitly balancing simplicity and completeness (minimal yet sufficient models), demanding early and continued stakeholder engagement, flexible scope boundary setting, and careful requirement elicitation. Implementation independence ensures ontological decisions are not prematurely constrained by specific formal languages (OWL, RDF, etc.).
- Componentization and Upper Ontologies (Glazunov, 2012, Daponte et al., 2021): Rigorously defining individuals, classes, relations, and axioms, and distinguishing between high-level (upper) ontologies and domain ontologies, supports transparency of assumptions and supports reuse without importing hidden theoretical commitments.
- Algebraic Operations and Constraint Minimization (Casanova et al., 2018): Scientific ontologies treated as formal theories may be modularly constructed and refined through operations such as projection, union, intersection, deprecation, and difference. Minimization algorithms remove redundant or implicitly derivable constraints, yielding representations with explicit, necessary axioms only.
- Hybrid AI and Data-Driven Topic Discovery (Pisu et al., 6 Aug 2025, Kumar et al., 2023): Minimize subjective pre-sorting by leveraging LLMs (e.g., SciBERT) for semantic relation classification of extracted topics, while integrating statistical literature-based features (co-occurrence, subsumption). Automated topic modeling (BERTopic, UMAP, HDBSCAN) on large multidisciplinary corpora enables identification of conventional and unconventional topics without a priori domain boundaries.
- Literate Programming and Documentation-Driven Approaches (Lord et al., 2015): Simultaneous development of human-readable documentation and ontology code (“lenticular text”) ensures the rationale for modeling decisions is attached to their formalization, allowing later audit of subjective choices.
3. Architectural and Technical Design Principles
Several concrete design choices further support bias minimization:
- Separation of Concerns: For instance, in the Research Object (RO) suite (Belhajjame et al., 2014), core container ontologies handle resource aggregation separately from annotation and provenance, allowing for rich metadata to describe the full scientific context while keeping the conceptual model domain-neutral.
- Flexible, Extensible Layers: The SSBD Ontology (Yamagata et al., 4 Aug 2025) adopts a two-tier architecture: a core layer (class-centric, referencing external biomedical ontologies such as GO, CL, UBERON) for minimum publishable metadata, and an instance layer for dataset-specific RDF individuals. Such separation allows rapid sharing without loss of future annotation fidelity.
- Standardized, Community-Curated Controlled Vocabularies: Integrating with established ontologies (supported by OBO Foundry, Planteome, AgroPortal, etc. (Dumschott et al., 2023)) ensures broader, peer-reviewed consensus and reduces individualistic or institution-centric bias.
- Implementation Independence: Conceptual models are constructed and validated absent commitment to any one formalism. For example, the mapping from domain axioms to rule-based representations (as in JEOE) is deferred until the late stages, enabling greater generality and reuse (Maio, 2011).
- Explicit Priority and Provisionality Annotations: Assigning numeric priority to classes or properties and using “potential” annotations for property domain/range (e.g., potentialDomain, potentialRange) in RDF (Fabbri, 2017) allows the ontology to carry information about which components are foundational, peripheral, or illustrative, clarifying which aspects are core scientific consensus and which are more tentative.
4. Process, Validation, and Iterative Refinement
- Broad, Interdisciplinary Stakeholder Involvement: Early and continuous engagement of domain experts, end-users, methodologists, and policy-makers ensures diverse perspectives are represented and that the ontology’s design choices are subject to scrutiny and feedback (Maio, 2011, Estañol et al., 2017).
- Iterative Expert Feedback and Reasoner-Based Testing: As in the obesity case paper (Estañol et al., 2017), ontologies are iteratively refined by formal consistency testing (e.g., using Protégé, theorem provers) and by repeated rounds of domain expert review, supporting incremental reduction of both logical and representational bias.
- Automated Selection and Evaluation in Ontology Mapping: Recent advances employ fine-tuned transformers (BERT variants) and rigorous multi-model statistical evaluation to map texts to appropriate ontologies, with cross-validation to detect residual bias in classifier predictions (Korel et al., 2023).
- Automated Enrichment Using LLMs in Low-Resource Domains: For domains that lack curated ontologies, LLMs can generate definitions for automatically extracted concepts and establish semantic relations from co-occurrence, supporting bootstrapping of a structured, unbiased knowledge base from limited data (Brinner et al., 27 Mar 2025).
5. Empirical and Benchmark-Driven Justification
- Quantitative Benchmarking: Approaches are validated using custom, domain-relevant benchmarks (e.g., for invasion biology (Brinner et al., 27 Mar 2025)) with multiple tasks and aggregate F1 scores, NDCG, or span-level metrics to evaluate both accuracy and generalizability of ontological models.
- Practical Application and Integration: Real-world use cases—from COVID-19 ontology audits (Keet, 2021), to FAIR data annotation in plant research (Dumschott et al., 2023), to interoperability in bioimaging metadata frameworks (Yamagata et al., 4 Aug 2025)—demonstrate that minimally-biased ontologies enable effective data sharing, semantic search, and interdisciplinary collaboration.
- Tooling and Ecosystem Support: Adoption is facilitated by tools such as the Research Object Manager and RODL (Belhajjame et al., 2014), Protégé plugins for algebraic manipulation (Casanova et al., 2018), and semantic web technical stacks (OWL, RDF, SHACL). By adhering to open standards and integrating with repositories, ontologies maximize reusability and minimize insular or siloed representations.
6. Philosophical Foundations and Participatory Models
A final, critical foundation for minimizing bias is explicit engagement with philosophical ontology, moving beyond the extremes of “purely scientific” or “purely philosophical” approaches. The participation model (Merrill, 2019) advocates for active, direct collaboration of philosophers with scientists—bringing logical analysis, conceptual rigor, attention to ambiguity, and meta-ontological tools to practical ontology design. This approach supports:
- Principle-driven integration of semantic structures,
- Critical analysis of model overlap, synonymy, and mapping issues,
- Methodological pluralism and explicit meta-data on design commitments.
This cooperative stance supports robust, flexible models suited for complex, evolving scientific domains, while maintaining transparency of underlying assumptions and a pathway for critical review and revision.
By synthesizing formalism, ecological stakeholder engagement, iterative expert validation, architecture that separates concerns and preserves implementation independence, and strategies such as algebraic minimization and explicit annotation of provisionalities, modern scientific ontologies can substantially limit—and make transparent—the biases inherent in representing complex, multifaceted domains of knowledge. The corpus of recent research cited here forms a methodological and technical blueprint for this ongoing endeavor.