Knowledge Graph Embeddings: Vector Approaches

Updated 17 July 2025

Knowledge graph embeddings are vector representations that map entities and relations into a continuous space while preserving structural and semantic information.
Generative, multi-modal, and skip-gram based methods integrate diverse data to enable tasks like link prediction, entity recommendation, and semantic retrieval.
Techniques such as random walks, duality regularizers, and parameter efficiency strategies enhance the scalability, interpretability, and effectiveness of these embeddings.

Knowledge graph embeddings (KGEs) are vector embeddings of entities and relations within a knowledge graph (KG) that aim to preserve the graph's inherent structure and semantics in a continuous vector space. These embeddings facilitate various tasks, including link prediction, entity recommendation, and question answering, by enabling efficient computation of semantic relationships.

1. Generative Approaches to Knowledge Graph Embeddings

Generative models provide a probabilistic framework for creating knowledge graph embeddings. For example, GenVector (Yang et al., 2015) is a multi-modal Bayesian embedding model that generates latent topics to produce representations for social network users and knowledge concepts. The model leverages a Normal-Gamma distribution to derive topic-specific parameters for both modalities and uses a Dirichlet prior to sample a topic distribution for each user. This approach connects online social networks to open knowledge bases by co-representing users and concepts in a shared latent topic space.

Multi-modal KGEs integrate information from various data sources to enhance the quality of embeddings. GenVector (Yang et al., 2015) exemplifies this approach by combining social network data (user network structure and text from publications) with knowledge concepts (textual descriptions from external knowledge bases). By representing both users and concepts in a shared latent topic space, GenVector fuses heterogeneous information, leading to improved semantic representations and better connections between social network users and external knowledge. This approach is broadly applicable in scenarios where data stems from heterogeneous sources.

3. Skip-Gram and Learning-Based Scoring Functions

The skip-gram model, commonly used in word embeddings, has been adapted for KGEs. KG2Vec (Soru et al., 2018) adapts the skip-gram model by treating each triple as a "small sentence" and maximizing the average log probability of tokens within the triple. Instead of using a predefined scoring function, KG2Vec learns it using Long Short-Term Memory (LSTM) networks, processing the sequence of embeddings for a triple and outputting a score between 0 and 1. This approach allows for faster processing and scalability on large knowledge graphs.

4. Exploiting Graph Structure via Random Walks

Random walks on knowledge graphs are a technique to generate context for entities, enabling the application of word embedding techniques. RDF2Vec (Azmy et al., 2019) employs random graph walks to "unfold" RDF graphs into sequences, treating these walks as sentences for training a Word2Vec model. A key aspect is the transformation of RDF graphs into sequences of entities and predicates, effectively treating walks on the graph as sentences. Triple2Vec (Fionda et al., 2019) extends this by constructing a triple line graph, where nodes represent triples and edges represent shared entities. RDF2Vec Light (Portisch et al., 2020) reduces computational overhead by applying the method to a subgraph of entities of interest.

5. Addressing Semantic Ambiguity and Improving Interpretability

Semantic matching models for KGEs often use inner products to measure the plausibility of triples, assuming that semantically similar entities have similar embeddings. However, this approach has limitations. DURA (Wang et al., 2022) introduces a duality-induced regularizer that leverages distance-based models to encourage semantically similar entities to have similar embeddings, improving performance in both static and temporal knowledge graphs. DisenE (Kou et al., 2020) introduces an attention-based mechanism that explicitly focuses on relevant components of entity embeddings based on a given relation. Furthermore, DisenE uses regularizers to ensure each component independently reflects an isolated semantic aspect, addressing the interpretability of KGEs.

6. Techniques for Improving Parameter Efficiency

High-performing KGE models often suffer from over-parameterization and increased computational complexity. Kronecker decomposition has been used to reduce the number of parameters in a KGE model while retaining its expressiveness (Demir et al., 2022). By splitting large embedding matrices into smaller matrices during training, this technique reconstructs embeddings on the fly and implicitly reduces redundancy in embedding vectors, encouraging feature reuse.

7. Assessing and Enhancing Semantic Accuracy

Assessing the extent to which KGEs capture semantic relations remains a challenge. One approach involves generating word and concept pair datasets from KGs and evaluating how well pre-trained embeddings capture those relations (Denaux et al., 2019). Such analyses reveal that corpus-based embeddings excel at capturing relations between words and concepts, but struggle with purely conceptual relations. To address this limitation, MASCHInE (Hubert et al., 2023) designs heuristics for generating protographs based on RDF/S information, which are then used to learn KGEs that better capture semantics. Another approach, taking inspiration from physics, represents KG elements as points in space subject to attractive and repulsive forces, combined with simulated annealing to achieve stable convergence (Demir et al., 2020).