Unified Taxonomy of Scientific Data

Updated 3 September 2025

Unified taxonomies of scientific data are comprehensive frameworks that classify, organize, and interlink diverse research outputs through citation analysis and semantic modeling.
They employ methods like direct citation, bibliographic coupling, and hierarchical random branching to achieve high topical concentration and improved semantic granularity.
These taxonomies enable robust cross-domain interoperability, streamlined research evaluation, and integrate theory-guided data science with scalable, multimodal representations.

A unified taxonomy of scientific data refers to a comprehensive framework for classifying, organizing, and linking the myriad forms of scientific information and knowledge structures, with the goal of enabling robust cross-domain interoperability, evaluation, and automated discovery. Such taxonomies draw on principles from citation analysis, hierarchical modeling, ontological engineering, statistical theory, and the evolving requirements of large-scale artificial intelligence systems. They are essential to research evaluation, data integration, semantic search, and the coevolution of scientific AI models and their knowledge substrate.

1. Foundations and Structural Principles

Unified scientific data taxonomies rest on a combination of document-centric linkage schemes, hierarchical organization, and formal semantic modeling. Foundational approaches distinguish between topic-level (document-centric) taxonomies and higher-level, discipline-based (journal or domain schema) classifications. Methodological paradigms include:

Direct Citation (DC): A document is linked directly to its references, forming first-order connections that inductively build the cumulative, historical record of science. DC taxonomies excel in capturing the integrated evolution of research topics and in producing concentrated, accurate thematic clusters.
Bibliographic Coupling (BC): Connects documents sharing overlapping references, providing a snapshot of the current research front, but can obscure underlying historical context.
Co-citation (CC): Groups papers frequently cited together, often reflecting retrospective associations as perceived by later authors, but tending to diffuse topical coherence.

Topic-level (document-based) structures, especially those built via direct citation and optimized via modularity-based clustering algorithms (such as the CWTS smart local moving algorithm), outperform discipline-level (journal cluster) approaches in accuracy and semantic granularity (Klavans et al., 2015).

Hierarchical taxonomies, as characterized in abundance and null distribution studies, can be modeled via non-parametric random branching processes, where the only parameters are the total number of items $n$ and the observed number of nonempty categories $k$ (D'Amico et al., 2016). Such models statistically unify the universal organization of scientific data regardless of specific disciplinary context.

2. Evaluation Metrics and Empirical Accuracy

Accuracy in taxonomy construction is operationalized by evaluating the topical concentration of reference distributions, especially in synthesis or review papers considered as "gold standards" (with at least 100 references). The principal quantitative measure is the Herfindahl index:

$H_p = \sum_j s_j^2,\quad \text{where}~~ s_j = \frac{n_j}{N_p}$

$n_j:$ Number of references from paper $p$ in cluster $j$
$N_p:$ Total gold-standard references indexed

The mean $H_i$ over all $P$ gold-standard papers is then:

$H_i = \frac{1}{P}\sum_p H_p$

Empirical results show that direct citation–based clustering achieves substantially higher reference concentration (e.g., 86.1% of gold-standard papers better concentrated under DC than any other method; best BC only reached 13.7%) (Klavans et al., 2015). Journal-based schemas were markedly inferior, in some cases less concentrated than CC itself.

This metric-driven, document-based taxonomy directly informs science policy, research evaluation, and resource allocation, underscoring the risks of relying on coarse, discipline-level classifications.

3. Statistical and Generative Models of Hierarchy

A universal, non-parametric framework for hierarchical taxonomy—the random binary branching process—yields predictions of category abundances, missing (unrepresented) categories, and statistical variance in category size distributions (D'Amico et al., 2016). The key generative mechanism is:

Iterative binary splitting of a root category until $q$ leaves
Category probability as a function of tree depth $b$ :

$p \sim 2^{-b}$

with random smearing to account for distributional variability

Probability distribution for tree depth:

$P(b, q) = 2^{-b} \frac{C(q-1, b)}{\Gamma(q+1)}$

where $C(q{-}1, b)$ is the unsigned Stirling number of the first kind.

This process, when paired with multinomial sampling for $n$ items, reproduces properties observed across real-world taxonomies—few categories with high counts and many with low counts, variance approximated by

$\sigma^2_{\ln p} \simeq (\ln 2)^2(-3.4 + 2 \ln q)$

A notable implication is the model's predictive power for estimating the number of unrepresented categories in finite samples, with applications in survey completeness and diversity estimation.

4. Ontologies, Semantic Interoperability, and Knowledge Graphs

Unified taxonomies increasingly rely on explicit ontologies, providing machine-readable, extensible, and semantically precise representations of scientific domains.

Science and Technology Ontology (S&TO): An automated, BERTopic-driven topic graph constructed from large-scale scientific corpora (393,991 articles spanning multiple fields), with 5,153 topics and 13,155 semantic relations (Kumar et al., 2023). S&TO links both conventional and unconventional topics across computer science, physics, chemistry, and engineering, employing relational predicates such as "relatedIdentical" (cosine similarity $\geq 0.9$ ), "superTopicOf" (hierarchy), and "CommonArticles" (shared evidence). The construction pipeline integrates sentence embedding (paraphrase-MiniLM-L12-v2), UMAP-based dimension reduction, and HDBSCAN clustering, with customized c-TF-IDF weighting:

$w_{x, c} = |tf_{x, c}| \times \log\left(1 + \frac{A}{f_x}\right)$

where $tf_{x, c}$ is term frequency in cluster $c$ , $f_x$ global frequency, $A$ the average per class.

NFDI4DSO Ontology: A BFO-compliant, modular extension for Data Science and AI, grounded in NFDICore and mapped to Basic Formal Ontology (BFO) upper concepts (Gesese et al., 16 Aug 2024). Shortcut properties (e.g., nfdi4dso:spokesperson) are governed by SWRL rules ensuring both practical encoding and semantic rigor:

$\begin{aligned} \text{Person}(?p) \wedge \text{Consortium}(?c) \ldots \rightarrow \text{spokesperson}(?c, ?p) \end{aligned}$

The knowledge graph distinguishes a Research Information Graph (RIG) for organizational metadata and a Research Data Graph (RDG) for aggregated research content, ensuring FAIR compliance.

Lightweight Analytic Operation Taxonomies: Efforts to automate analytic workflows leverage abstract, hierarchical operations (aggregation, boolean, data transformations) defined via LaTeX-specified arities and type constraints, with "domain labeling" to bind schema-agnostic operations to concrete attributes (Sterbentz et al., 2023).

5. The Role of Theory-Guided Data Science

Unified taxonomies benefit from embedding theory-guided frameworks (TGDS), where domain laws, physical constraints, and interpretability are integrated across design, learning, refinement, and hybrid model composition (Karpatne et al., 2016). Key dimensions include:

Theory-guided design (structuring models in line with domain principles)
Theory-guided learning (imposing priors or constraints, e.g., PDEs, conservation laws)
Theory-guided refinement (post-processing to ensure physical plausibility)
Hybrid and data assimilation strategies (combining mechanistic models with data-driven corrections)

For example, hydrological ANN design employs modular decomposition by physical sub-process, while DFT functional learning enforces the Euler-Lagrange constraint: $\frac{\delta \hat{T}[n_0]}{\delta n_0(r)} = \mu - v(r)$ where $v(r)$ is the external potential and $\mu$ a Lagrange multiplier.

TGDS advances unified taxonomic frameworks by ensuring that scientific data representations remain physically meaningful and interpretable, enabling cross-disciplinary generalization.

6. Scaling, Multimodality, and the Evolution of Scientific LLMs

Recent developments highlight the co-evolution of large scientific LLMs (Sci-LLMs) and their data substrates, which demand unified, hierarchical, and modality-aware taxonomies:

Modality Coverage: Scientific data span text (papers, notebooks), visual (microscopy, charts), symbolic (SMILES, equations), structured (tables, graphs), time-series (EEG, climate), and multi-omics integrative datasets (Hu et al., 28 Aug 2025).
Hierarchical Knowledge Model: Scientific knowledge is structured by layers: factual (raw measurement), theoretical (laws, equations), methodological (experimental/computational processes), modeling/simulation, and insight (transformative synthesis).
Model–Data Co-evolution: As Sci-LLMs adopt broader and more curated training mixes (e.g., Galactica, Intern-S1, ChemLLM, LLMPhy), the unified taxonomy must support cross-modal representation, domain-specific semantics, and reasoning that respects these layered relationships.
Evaluation Paradigm Shift: From static, recall-based benchmarks to process- and reasoning-oriented assessments, including multi-hop problem solving, source citation, and agent-based workflows.

Semi-automated annotation pipelines, expert validation loops, and integration with agentic frameworks (labs, simulation environments, operating-system-level protocols) (Hu et al., 28 Aug 2025) exemplify how unified taxonomies must be adaptive, extensible, and aligned with both human and AI-driven discovery.

Summary Table: Core Taxonomy Construction Methods

Method/Model	Structural Principle	Evaluative Basis
Direct Citation	Document-to-reference link	Herfindahl index, gold standards
Bibliographic Coupling	Shared reference overlap	Herfindahl index
Co-citation	Co-cited papers by downstream documents	Herfindahl index
Branching Model	Random binary tree, $n$ items, $k$ categories	Variance $\sigma^2_{\ln p}$ , null distribution
Ontological Engineering	Topic graph, BFO-compliance, knowledge graph	Semantic relations, domain coverage
TGDS	Theory-data integration, modularization	Scientific consistency, interpretability
Sci-LLM Data Taxonomy	Hierarchical, multimodal, co-evolving	Process-based evaluation (chain-of-thought, agentic tasks)

Unified scientific data taxonomies integrate citation-centric partitioning, statistical hierarchy, explicit ontological schemas, and domain-theoretical grounding, forming the backbone for scalable, accurate, and interoperable scientific knowledge infrastructures. Their refinement and adoption facilitate not only traditional analytics and policy evaluation but also enable automated discovery, large-scale scientific modeling, and closed-loop human–AI collaboration.