SciToolKG: Scientific Tool Knowledge Graph
- SciToolKG is a structured, extensible knowledge graph that interlinks scientific concepts and software code to support interdisciplinary research.
- It employs unsupervised methods, including UMAP and DBSCAN, to extract and cluster entities from both natural language and code.
- The graph enhances AI-driven model comparison, semantic search, and automated reasoning, proving useful for reproducible research and enhanced software understanding.
The Scientific Tool Knowledge Graph (SciToolKG) is a structured, extensible representation of concepts, software, and source code relationships central to scientific modeling and computational research. It is engineered to facilitate semantic search, automated software understanding, and interdisciplinary model comparison by unsupervised extraction and integration of conceptual entities from both natural language and programmatic artifacts.
1. Foundations and Objectives
SciToolKG builds on the premise that contemporary scientific literature is frequently accompanied by source code (especially in reproducible research and interactive textbooks). This dual-modal corpus enables automated extraction of conceptual entities and their associations to software implementations. The principal objectives are:
- Creation of a semantic index over both explanatory text and scientific code for rapid comprehension, model comparison, and onboarding.
- Corpus-wide organization to support advanced semantic search, reasoning, and AI agent integration in open science ecosystems.
SciToolKG deviates from supervised, ontology-based systems and instead uses unsupervised methods grounded in word embeddings, dimensionality reduction, and density-based clustering (Cao et al., 2019).
2. Unsupervised Extraction and Graph Construction
2.1 Data Sources and Preprocessing
The pipeline processes open-source interactive textbooks composed in markdown and Jupyter notebooks (e.g., the Epirecipes Cookbook), which natively interleave explanatory text with executable code. This structure simplifies synchronous extraction of both modalities.
- Text Extraction: Natural language is parsed using spaCy to decompose each sentence into <subject, verb, object> triples. Subjects and objects become nodes; verbs serve as directed edges labeling semantic relationships.
- Code Extraction: Source code is parsed for variable and function identifiers, substituting Greek symbols with text for increased alignment to natural language. Regular expressions and custom routines systematically clean both text and code content.
2.2 Embedding and Entity Representation
- Both textual and code-derived entities are embedded into a shared high-dimensional vector space using word embedding models (e.g., Word2Vec).
- Entity semantic similarity provides the alignment score between code and conceptual nodes.
2.3 Nonlinear Dimensionality Reduction and Clustering
- Uniform Manifold Approximation and Projection (UMAP) is applied to the embedding space to reduce dimensionality. UMAP facilitates cluster formation by bringing together entities that are semantically similar, though lexically diverse (e.g., "The Model," "The Model of").
- Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is then used to discover clusters without a priori knowledge of cluster count, enabling the identification of conceptual archetypes and noise filtering.
Hierarchical entity formation is achieved by:
- UMAP+DBSCAN applied to subject nodes (outlier/noise removal).
- DBSCAN (without UMAP) on object nodes to maximize syntactic similarity.
- Variable/function nodes are linked to concept nodes if their embedding similarity exceeds a chosen threshold.
3. Linking Code Entities to Concepts
Pragmatic naming conventions (e.g., function/variable names) in scholarly software facilitate linking code-level entities to their conceptual representations, exploiting the tendency of scientists to encode semantic information in identifiers.
- Linking pipeline:
- For each candidate association (code variable/function–concept node), cosine similarity in embedding space is calculated.
- Associations are retained only when exceeding a set threshold, which is experimentally tuned (using labeled ground truth) to favor recall—subsequent filtering removes imprecise matches.
Manual annotation of variable-object pairs is employed to evaluate the precision and recall of associations. Precision/recall curves inform threshold selection and are complemented by conductance metrics for interdisciplinary graph evaluation.
4. Graph Structure, Interdisciplinary Fusion, and Evaluation
4.1 Graph Statistics and Pruning
For an exemplar corpus (Epirecipes):
- 115 object nodes and 93 subject nodes.
- 4,000–13,000 edges after thresholding.
- Clusters (node size ≤ 5) are pruned, reducing noise and preserving conceptual relevance.
4.2 Merging Multiple Corpora
Integration of graphs from different domains (e.g., Epirecipes and Statistics with Julia) typically yields disconnected subgraphs, quantified by conductance:
where .
Conductance measures the degree of cross-corpus linkage: a drop in conductance with stringent code-concept similarity thresholds reflects separation of disciplines; higher values indicate greater semantic overlap.
4.3 Mapping Quality
Precision () and recall () metrics on manually labeled links validate association quality and support selection of operating thresholds for practical use.
5. Applications and Functional Benefits
5.1 Enhanced Software Understanding
SciToolKG formalizes the relationship between scientific software and abstract concepts, supporting:
- Semantic search for model comparison and discovery.
- Identification of related code modules by conceptual similarity, aiding extension and modification.
5.2 Interdisciplinary Exploration
By merging multiple domain graphs, SciToolKG reveals disciplinary boundaries and overlap—enabling cross-fertilization and transfer of modeling techniques.
5.3 AI and Automated Reasoning
The structured, cross-modal knowledge graph serves as a substrate for intelligent agents, onboarding tools, and systems such as SemanticModels.jl, which automate the identification, retrieval, and extension of existing computational models.
5.4 Onboarding and Collaboration
Clear mappings between code and concepts enable new project participants to rapidly assimilate codebases and their scientific rationale, enhancing collaborative efficacy.
6. Algorithms, Formalisms, and Trade-offs
| Step | Technique/Algorithm | Purpose |
|---|---|---|
| Text Extraction | NLP (spaCy), regex | Identify <subject, verb, object> triples |
| Code Extraction | Static parsing, normalization | Identify variables and functions |
| Semantic Representation | Word embeddings | Transform text/code to vector space |
| Dimensionality Reduction | UMAP | Group semantically close entities for clustering |
| Clustering | DBSCAN | Discover conceptual entity clusters |
| Association | Similarity thresholding | Link code entities to concepts |
| Evaluation | Precision/Recall, Conductance | Assess mapping quality, interdisciplinary connections |
| Graph Pruning | Node size filtering | Remove extraneous concepts |
The use of unsupervised, threshold-tuned clustering offers robustness and domain-independence but necessitates post-hoc manual filtering and threshold optimization to maximize applicable recall with acceptable precision.
7. Limitations and Extensibility
- Corpus-dependence: Extraction quality hinges on the structure and clarity of markdown/Jupyter source corpora.
- Semantic alignment: Naming conventions, while informative, may deviate from formal ontologies, limiting full disambiguation without additional curation.
- Extensibility: The pipeline is applicable to new corpora without modification, making it suitable for evolving scientific domains and large-scale deployments.
The approach sets a foundation for domain-adaptive, cross-modal knowledge graphs linking code, models, and scholarly context, extensible as open code and scientific publishing practices mature.
In summary, SciToolKG provides an unsupervised, architecture-driven methodology for organizing and interlinking conceptual and computational facets of scientific research, emphasizing the joint clustering of text and code entities, semantic similarity association, and cross-disciplinary integration. Its design and evaluation are optimized for accuracy, extensibility, and utility in automated reasoning, discovery, and collaborative science (Cao et al., 2019).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free