Semi-Reference Distance (SemRefD)

Updated 13 February 2026

SemRefD is a directed measure that distinguishes prerequisite ordering between concept pairs by using both hierarchical semantic proximity and non-hierarchical knowledge graph links.
It integrates category hierarchy relations with DBpedia’s property-based connections, employing a normalized weighting function to capture dependency strength.
Empirical findings show that high SemRefD scores correlate with expert curriculum sequencing, outperforming traditional symmetric similarity measures.

Semi-Reference Distance (SemRefD) is a directed, knowledge-graph–based measure of prerequisite dependency between two concepts, designed to capture the extent to which one concept serves as a prerequisite for another. SemRefD builds upon the earlier Reference Distance (RefD) by augmenting hierarchical semantic relations with non-hierarchical knowledge graph links, utilizing both the category hierarchy and property edges in a structured knowledge base—specifically, DBpedia. It quantitatively distinguishes between concept pairs that exhibit strong directionality (e.g., A is a prerequisite of B) and those with ambiguous or no ordering. This measure operates at the intersection of semantic similarity analysis, knowledge graph traversal, and educational curriculum modeling, supporting principled prerequisite identification among curriculum concepts (Cheng, 2022).

1. Formal Definition and Theoretical Assumptions

SemRefD is formulated for pairs of concepts $c_1$ , $c_2$ , such that SemRefD $(c_1, c_2)$ is positive if $c_2$ is a stronger prerequisite for $c_1$ than vice versa, negative in the converse case, and near zero when no clear ordering exists. It relies on two critical semantic features extracted from a background Knowledge Graph (KG):

Weighting Function $s(\cdot,\cdot)$ : Quantifies the hierarchical (category-based) semantic proximity between concept pairs based on the shortest-path in the KG's category hierarchy.
Indicator Function $i(\cdot, \cdot)$ : Detects the existence of non-hierarchical, property-path-based links (e.g., cross-references such as dbo:wikiPageWikiLink or skos:broader).

The method assumes each extracted domain concept is mappable to at least one DBpedia resource, and that traversing DBpedia with a hop limit $m=1$ yields sufficient semantically relevant candidate neighbors while constraining computational complexity and limiting noise.

2. Mathematical Formulation

Given a pair of target concepts $c_1$ and $c_2$ , let $C = \{c_j\}_{1\leq j \leq k}$ denote the set of "common" neighbor concepts derived from merging all category and property-path neighbors within the hop limit $m$ . The two defining functions are:

$s(c_j, c_i)$ : Nonnegative weight for the hierarchical link between $c_j$ and $c_i$ (category path proximity)
$i(c_j, c_i) \in \{0,1\}$ : Indicator of any non-hierarchical property path from $c_j$ to $c_i$

The SemRefD value is computed as: $\mathrm{SemRefD}(c_{1},c_{2}) = \frac{\sum_{j=1}^{k} i(c_{j},c_{2})\,s(c_{j},c_{1})} {\sum_{j=1}^{k} s(c_{j},c_{1})} - \frac{\sum_{j=1}^{k} i(c_{j},c_{1})\,s(c_{j},c_{2})} {\sum_{j=1}^{k} s(c_{j},c_{2})}\,.$

Each denominator ensures per-concept normalization, keeping $\sum_j s(c_j, c_i) = 1$ . Consequently, SemRefD is bounded to $[-1, +1]$ provided both denominator sums are positive. The formula explicitly encodes directionality, contrasting the hierarchical-weighted, property-linked mass relative to each concept.

3. Algorithmic Workflow

The computation of SemRefD is organized as a sequential process applied to concept descriptions (e.g., course syllabi):

Knowledge Graph Setup: Load DBpedia (via SPARQL or local dump).
Concept Extraction: Invoke the TextRazor API on target texts to extract top-ranked entity mentions, mapping each to corresponding DBpedia URIs ( $E_1$ , $E_2$ for each description).
Candidate Concept Generation: For each entity $c \in E_1 \cup E_2$ $c \in E_{1} \cup E_{2}$ , retrieve:
- Category neighbors: $\{u \mid (c, \mathrm{dct:subject}, u)\}$
- Property neighbors: $\{u \mid (c, p, u)\}$ for property set (e.g., dbo:wikiPageWikiLink, skos:broader) This yields $C$ .
Semantic Weighting and Indicator Calculation: For each $c_j \in C$ $c_{j} \in C$ and each target $c_i \in \{c_1, c_2\}$ $c_{i} \in {c_{1}, c_{2}}$ :
- Compute $d_{ij} =$ shortest-path length in category hierarchy, limited to $m$ hops
- Raw weight $w_{ij} = 1/(1 + d_{ij})$ ; normalized as $s(c_j, c_i) = w_{ij} / (\sum_{u\in C} w_{iu})$
- Set $i(c_j, c_i) = 1$ if any selected property edge exists between $c_j$ and $c_i$ , else $0$
SemRefD Calculation: Insert the above components into the formal definition
Interpretation: The sign and magnitude of SemRefD dictate direction and confidence in the inferred prerequisite relationship.

4. Illustrative Example: Curriculum Prerequisite Detection

In a concrete application to university course prerequisites, courses COMP1100 (Programming as Problem Solving) and COMP2100 (Software Design Methodologies) are analyzed:

TextRazor extracts: $E_1 \approx$ {Programming, Problem Solving, Algorithm}, $E_2 \approx$ {Software Design, UML, Object–Oriented Design}
Candidate $C$ includes DBpedia URIs for concepts such as Programming, Algorithm, Software Engineering, Unified Modeling Language
With example path lengths: $d(\mathrm{Programming}\to COMP1100)=1$ ( $w=1/2$ ), $d(\mathrm{Software Engineering}\to COMP1100)=2$ ( $w=1/3$ )
Normalized weights: $s(\mathrm{Programming}, COMP1100) = 0.6$ , $s(\mathrm{SE}, COMP1100) = 0.4$
For cross-links: $i(\mathrm{Programming}, COMP2100) = 1$ for “wikiPageWikiLink”
Numerator and denominator terms produce, e.g., $1.0 - 0.2 = 0.8$ (strongly positive)

Empirically, the actual experiment reports: cosine-similarity $(COMP1100, COMP2100) = 0.9479$ , and SemRefD $(COMP1100, COMP2100) = +13.17$ (in their unnormalized implementation) (Cheng, 2022).

Course Pair	Cosine Similarity	SemRefD Score
COMP1100, COMP2100	0.9479	+13.17
COMP1110, COMP2100	n/a	+12.06
COMP1100, COMP1110	n/a	+4.40

These figures demonstrate that high semantic similarity does not resolve sequencing, but high SemRefD score robustly indicates directionality.

5. Parameter Choices and Their Effects

The tight control of algorithm parameters governs the trade-off between precision, recall, and computational tractability:

Hop Limit $m$ : Default is $m=1$ , balancing recall (of mediated semantic connections) and noise suppression; higher $m$ increases semantic reach but reduces specificity.
Property Set Selection: Restricting $i(\cdot, \cdot)$ only to strong properties (e.g., skos:broader, dbo:wikiPageWikiLink) increases precision; admitting weaker links (e.g., rdfs:seeAlso) may boost recall but at cost of accuracy.
Weighting Function: Choice between $1/(1+d)$, $1/d$, or exponential decay (e.g., $\exp(-d)$ ) shapes the influence of close versus distant neighbors; steeper decay intensifies preference for topologically proximal nodes.
Normalization: Per-concept normalization ( $\sum_j s(c_j,c_i) = 1$ ) mitigates artifacts from concepts with broad semantic spread (large fan-out).

6. Empirical Findings and Comparative Assessment

Manual evaluation conducted on three core courses at ANU reveals that all high SemRefD scores (>10, as reported in their unnormalized implementation) associate directly with known prerequisite relationships (Cheng, 2022). In controlled comparisons:

F1 Improvement: SemRefD outperformed both the original Reference Distance (RefD) and a string-similarity (cosine/BERT) baseline, with an approximate 10 percentage point increase in F1 for concept-pair ordering tasks.
Directional Disambiguation: In contrast to raw semantic similarity (cosine > 0.94), SemRefD effectively disambiguates sequence, providing a reliable indication (precision > 0.85 when SemRefD > 0) of prerequisite relations.
Alignment with Human Judgment: High SemRefD scores correspond to expert opinion regarding proper curricular sequencing.

7. Summary and Significance

SemRefD operationalizes the combined use of hierarchical and property-based relations within open knowledge graphs to produce a normalized, directed measure of prerequisite dependency. Its workflow is reproducible, contingent on entity recognition (via TextRazor), knowledge graph traversal (DBpedia), and parameterized computation of weights and indicators. Its utility in educational curriculum analysis demonstrates its efficacy in resolving ambiguities in course sequencing where symmetric similarity measures fail. Implementers are advised to tune hop limit, property set, and weighting normalization to their respective domains and knowledge graph structures (Cheng, 2022).

Markdown Report Issue Upgrade to Chat

References (1)

Improving Students' Academic Performance with AI and Semantic Technologies (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Semi-Reference Distance (SemRefD).