Papers
Topics
Authors
Recent
2000 character limit reached

SciToolKG: Scientific Tool Knowledge Graph

Updated 2 November 2025
  • SciToolKG is a structured, extensible knowledge graph that interlinks scientific concepts and software code to support interdisciplinary research.
  • It employs unsupervised methods, including UMAP and DBSCAN, to extract and cluster entities from both natural language and code.
  • The graph enhances AI-driven model comparison, semantic search, and automated reasoning, proving useful for reproducible research and enhanced software understanding.

The Scientific Tool Knowledge Graph (SciToolKG) is a structured, extensible representation of concepts, software, and source code relationships central to scientific modeling and computational research. It is engineered to facilitate semantic search, automated software understanding, and interdisciplinary model comparison by unsupervised extraction and integration of conceptual entities from both natural language and programmatic artifacts.

1. Foundations and Objectives

SciToolKG builds on the premise that contemporary scientific literature is frequently accompanied by source code (especially in reproducible research and interactive textbooks). This dual-modal corpus enables automated extraction of conceptual entities and their associations to software implementations. The principal objectives are:

  • Creation of a semantic index over both explanatory text and scientific code for rapid comprehension, model comparison, and onboarding.
  • Corpus-wide organization to support advanced semantic search, reasoning, and AI agent integration in open science ecosystems.

SciToolKG deviates from supervised, ontology-based systems and instead uses unsupervised methods grounded in word embeddings, dimensionality reduction, and density-based clustering (Cao et al., 2019).

2. Unsupervised Extraction and Graph Construction

2.1 Data Sources and Preprocessing

The pipeline processes open-source interactive textbooks composed in markdown and Jupyter notebooks (e.g., the Epirecipes Cookbook), which natively interleave explanatory text with executable code. This structure simplifies synchronous extraction of both modalities.

  • Text Extraction: Natural language is parsed using spaCy to decompose each sentence into <subject, verb, object> triples. Subjects and objects become nodes; verbs serve as directed edges labeling semantic relationships.
  • Code Extraction: Source code is parsed for variable and function identifiers, substituting Greek symbols with text for increased alignment to natural language. Regular expressions and custom routines systematically clean both text and code content.

2.2 Embedding and Entity Representation

  • Both textual and code-derived entities are embedded into a shared high-dimensional vector space using word embedding models (e.g., Word2Vec).
  • Entity semantic similarity provides the alignment score between code and conceptual nodes.

2.3 Nonlinear Dimensionality Reduction and Clustering

  • Uniform Manifold Approximation and Projection (UMAP) is applied to the embedding space to reduce dimensionality. UMAP facilitates cluster formation by bringing together entities that are semantically similar, though lexically diverse (e.g., "The Model," "The Model of").
  • Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is then used to discover clusters without a priori knowledge of cluster count, enabling the identification of conceptual archetypes and noise filtering.

Hierarchical entity formation is achieved by:

  • UMAP+DBSCAN applied to subject nodes (outlier/noise removal).
  • DBSCAN (without UMAP) on object nodes to maximize syntactic similarity.
  • Variable/function nodes are linked to concept nodes if their embedding similarity exceeds a chosen threshold.

3. Linking Code Entities to Concepts

Pragmatic naming conventions (e.g., function/variable names) in scholarly software facilitate linking code-level entities to their conceptual representations, exploiting the tendency of scientists to encode semantic information in identifiers.

  • Linking pipeline:
    • For each candidate association (code variable/function–concept node), cosine similarity in embedding space is calculated.
    • Associations are retained only when exceeding a set threshold, which is experimentally tuned (using labeled ground truth) to favor recall—subsequent filtering removes imprecise matches.

Manual annotation of variable-object pairs is employed to evaluate the precision and recall of associations. Precision/recall curves inform threshold selection and are complemented by conductance metrics for interdisciplinary graph evaluation.

4. Graph Structure, Interdisciplinary Fusion, and Evaluation

4.1 Graph Statistics and Pruning

For an exemplar corpus (Epirecipes):

  • 115 object nodes and 93 subject nodes.
  • 4,000–13,000 edges after thresholding.
  • Clusters (node size ≤ 5) are pruned, reducing noise and preserving conceptual relevance.

4.2 Merging Multiple Corpora

Integration of graphs from different domains (e.g., Epirecipes and Statistics with Julia) typically yields disconnected subgraphs, quantified by conductance:

ϕ(S)=iS,jSˉaija(S)\phi(S) = \frac{\sum_{i \in S, j \in \bar{S}} a_{ij}}{a(S)}

where a(S)=iSjVaija(S) = \sum_{i \in S} \sum_{j \in V} a_{ij}.

Conductance measures the degree of cross-corpus linkage: a drop in conductance with stringent code-concept similarity thresholds reflects separation of disciplines; higher values indicate greater semantic overlap.

4.3 Mapping Quality

Precision (tptp+fp\frac{tp}{tp + fp}) and recall (tptp+fn\frac{tp}{tp + fn}) metrics on manually labeled links validate association quality and support selection of operating thresholds for practical use.

5. Applications and Functional Benefits

5.1 Enhanced Software Understanding

SciToolKG formalizes the relationship between scientific software and abstract concepts, supporting:

  • Semantic search for model comparison and discovery.
  • Identification of related code modules by conceptual similarity, aiding extension and modification.

5.2 Interdisciplinary Exploration

By merging multiple domain graphs, SciToolKG reveals disciplinary boundaries and overlap—enabling cross-fertilization and transfer of modeling techniques.

5.3 AI and Automated Reasoning

The structured, cross-modal knowledge graph serves as a substrate for intelligent agents, onboarding tools, and systems such as SemanticModels.jl, which automate the identification, retrieval, and extension of existing computational models.

5.4 Onboarding and Collaboration

Clear mappings between code and concepts enable new project participants to rapidly assimilate codebases and their scientific rationale, enhancing collaborative efficacy.

6. Algorithms, Formalisms, and Trade-offs

Step Technique/Algorithm Purpose
Text Extraction NLP (spaCy), regex Identify <subject, verb, object> triples
Code Extraction Static parsing, normalization Identify variables and functions
Semantic Representation Word embeddings Transform text/code to vector space
Dimensionality Reduction UMAP Group semantically close entities for clustering
Clustering DBSCAN Discover conceptual entity clusters
Association Similarity thresholding Link code entities to concepts
Evaluation Precision/Recall, Conductance Assess mapping quality, interdisciplinary connections
Graph Pruning Node size filtering Remove extraneous concepts

The use of unsupervised, threshold-tuned clustering offers robustness and domain-independence but necessitates post-hoc manual filtering and threshold optimization to maximize applicable recall with acceptable precision.

7. Limitations and Extensibility

  • Corpus-dependence: Extraction quality hinges on the structure and clarity of markdown/Jupyter source corpora.
  • Semantic alignment: Naming conventions, while informative, may deviate from formal ontologies, limiting full disambiguation without additional curation.
  • Extensibility: The pipeline is applicable to new corpora without modification, making it suitable for evolving scientific domains and large-scale deployments.

The approach sets a foundation for domain-adaptive, cross-modal knowledge graphs linking code, models, and scholarly context, extensible as open code and scientific publishing practices mature.


In summary, SciToolKG provides an unsupervised, architecture-driven methodology for organizing and interlinking conceptual and computational facets of scientific research, emphasizing the joint clustering of text and code entities, semantic similarity association, and cross-disciplinary integration. Its design and evaluation are optimized for accuracy, extensibility, and utility in automated reasoning, discovery, and collaborative science (Cao et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Scientific Tool Knowledge Graph (SciToolKG).