Semi-Reference Distance (SemRefD)
- SemRefD is a directed measure that distinguishes prerequisite ordering between concept pairs by using both hierarchical semantic proximity and non-hierarchical knowledge graph links.
- It integrates category hierarchy relations with DBpedia’s property-based connections, employing a normalized weighting function to capture dependency strength.
- Empirical findings show that high SemRefD scores correlate with expert curriculum sequencing, outperforming traditional symmetric similarity measures.
Semi-Reference Distance (SemRefD) is a directed, knowledge-graph–based measure of prerequisite dependency between two concepts, designed to capture the extent to which one concept serves as a prerequisite for another. SemRefD builds upon the earlier Reference Distance (RefD) by augmenting hierarchical semantic relations with non-hierarchical knowledge graph links, utilizing both the category hierarchy and property edges in a structured knowledge base—specifically, DBpedia. It quantitatively distinguishes between concept pairs that exhibit strong directionality (e.g., A is a prerequisite of B) and those with ambiguous or no ordering. This measure operates at the intersection of semantic similarity analysis, knowledge graph traversal, and educational curriculum modeling, supporting principled prerequisite identification among curriculum concepts (Cheng, 2022).
1. Formal Definition and Theoretical Assumptions
SemRefD is formulated for pairs of concepts , , such that SemRefD is positive if is a stronger prerequisite for than vice versa, negative in the converse case, and near zero when no clear ordering exists. It relies on two critical semantic features extracted from a background Knowledge Graph (KG):
- Weighting Function : Quantifies the hierarchical (category-based) semantic proximity between concept pairs based on the shortest-path in the KG's category hierarchy.
- Indicator Function : Detects the existence of non-hierarchical, property-path-based links (e.g., cross-references such as
dbo:wikiPageWikiLinkorskos:broader).
The method assumes each extracted domain concept is mappable to at least one DBpedia resource, and that traversing DBpedia with a hop limit yields sufficient semantically relevant candidate neighbors while constraining computational complexity and limiting noise.
2. Mathematical Formulation
Given a pair of target concepts and , let denote the set of "common" neighbor concepts derived from merging all category and property-path neighbors within the hop limit . The two defining functions are:
- : Nonnegative weight for the hierarchical link between and (category path proximity)
- : Indicator of any non-hierarchical property path from to
The SemRefD value is computed as:
Each denominator ensures per-concept normalization, keeping . Consequently, SemRefD is bounded to provided both denominator sums are positive. The formula explicitly encodes directionality, contrasting the hierarchical-weighted, property-linked mass relative to each concept.
3. Algorithmic Workflow
The computation of SemRefD is organized as a sequential process applied to concept descriptions (e.g., course syllabi):
- Knowledge Graph Setup: Load DBpedia (via SPARQL or local dump).
- Concept Extraction: Invoke the TextRazor API on target texts to extract top-ranked entity mentions, mapping each to corresponding DBpedia URIs (, for each description).
- Candidate Concept Generation: For each entity , retrieve:
- Category neighbors:
- Property neighbors: for property set (e.g.,
dbo:wikiPageWikiLink,skos:broader) This yields .
- Semantic Weighting and Indicator Calculation: For each and each target :
- Compute shortest-path length in category hierarchy, limited to hops
- Raw weight ; normalized as
- Set if any selected property edge exists between and , else $0$
- SemRefD Calculation: Insert the above components into the formal definition
- Interpretation: The sign and magnitude of SemRefD dictate direction and confidence in the inferred prerequisite relationship.
4. Illustrative Example: Curriculum Prerequisite Detection
In a concrete application to university course prerequisites, courses COMP1100 (Programming as Problem Solving) and COMP2100 (Software Design Methodologies) are analyzed:
- TextRazor extracts: {Programming, Problem Solving, Algorithm}, {Software Design, UML, Object–Oriented Design}
- Candidate includes DBpedia URIs for concepts such as Programming, Algorithm, Software Engineering, Unified Modeling Language
- With example path lengths: (), ()
- Normalized weights: ,
- For cross-links: for “wikiPageWikiLink”
- Numerator and denominator terms produce, e.g., $1.0 - 0.2 = 0.8$ (strongly positive)
Empirically, the actual experiment reports: cosine-similarity, and SemRefD (in their unnormalized implementation) (Cheng, 2022).
| Course Pair | Cosine Similarity | SemRefD Score |
|---|---|---|
| COMP1100, COMP2100 | 0.9479 | +13.17 |
| COMP1110, COMP2100 | n/a | +12.06 |
| COMP1100, COMP1110 | n/a | +4.40 |
These figures demonstrate that high semantic similarity does not resolve sequencing, but high SemRefD score robustly indicates directionality.
5. Parameter Choices and Their Effects
The tight control of algorithm parameters governs the trade-off between precision, recall, and computational tractability:
- Hop Limit : Default is , balancing recall (of mediated semantic connections) and noise suppression; higher increases semantic reach but reduces specificity.
- Property Set Selection: Restricting only to strong properties (e.g.,
skos:broader,dbo:wikiPageWikiLink) increases precision; admitting weaker links (e.g.,rdfs:seeAlso) may boost recall but at cost of accuracy. - Weighting Function: Choice between $1/(1+d)$, $1/d$, or exponential decay (e.g., ) shapes the influence of close versus distant neighbors; steeper decay intensifies preference for topologically proximal nodes.
- Normalization: Per-concept normalization () mitigates artifacts from concepts with broad semantic spread (large fan-out).
6. Empirical Findings and Comparative Assessment
Manual evaluation conducted on three core courses at ANU reveals that all high SemRefD scores (>10, as reported in their unnormalized implementation) associate directly with known prerequisite relationships (Cheng, 2022). In controlled comparisons:
- F1 Improvement: SemRefD outperformed both the original Reference Distance (RefD) and a string-similarity (cosine/BERT) baseline, with an approximate 10 percentage point increase in F1 for concept-pair ordering tasks.
- Directional Disambiguation: In contrast to raw semantic similarity (cosine > 0.94), SemRefD effectively disambiguates sequence, providing a reliable indication (precision > 0.85 when SemRefD > 0) of prerequisite relations.
- Alignment with Human Judgment: High SemRefD scores correspond to expert opinion regarding proper curricular sequencing.
7. Summary and Significance
SemRefD operationalizes the combined use of hierarchical and property-based relations within open knowledge graphs to produce a normalized, directed measure of prerequisite dependency. Its workflow is reproducible, contingent on entity recognition (via TextRazor), knowledge graph traversal (DBpedia), and parameterized computation of weights and indicators. Its utility in educational curriculum analysis demonstrates its efficacy in resolving ambiguities in course sequencing where symmetric similarity measures fail. Implementers are advised to tune hop limit, property set, and weighting normalization to their respective domains and knowledge graph structures (Cheng, 2022).