KGE-MoS: Mixture Softmax for KGC
- KGE-MoS is a mixture-based output layer for KGE models that overcomes the rank bottleneck in knowledge graph completion tasks.
- It employs multiple softmax heads to create a union of low-dimensional manifolds, enabling a richer and more flexible representation of output distributions.
- Empirical studies show that KGE-MoS achieves higher Mean Reciprocal Rank and lower Negative Log Likelihood while maintaining parameter efficiency across various benchmarks.
KGE-MoS is a mixture-based output layer for knowledge graph embedding (KGE) models, designed to break the expressivity limitations imposed by the rank bottleneck in knowledge graph completion (KGC) tasks. The approach draws inspiration from mixture of softmaxes techniques in LLMing, enabling KGC models to capture a richer variety of ranking and probability distributions over large knowledge graphs without substantially increasing parameter count.
1. Rank Bottleneck in Knowledge Graph Completion
Most KGE models for KGC predict the plausibility of triples by encoding the subject (), relation (), and scoring all possible object () candidates: where is the query vector resulting from composing and the relation parameters, and is the embedding of a candidate object entity.
Candidate objects are typically scored via a single vector-matrix multiplication: with , where is the number of entities and is the embedding dimension ( in most practical settings).
This output layer is fundamentally limited in rank: regardless of the number of queries or candidate entities, the output score matrix has rank at most . Consequently, only a low-dimensional subspace of ranking permutations or probability distributions over entities can be realized, forming a rank bottleneck. This phenomenon—also termed the softmax bottleneck—implies that many possible output distributions and rankings cannot be represented at all when is much smaller than .
2. Effects and Diagnosis of the Rank Bottleneck
Theoretical analysis establishes that:
- If the adjacency matrix of the target knowledge graph has rank , perfect reconstruction of target labelings or distributions is impossible.
- For sign (multi-label) or ranking tasks, many label assignments or rankings (permutation vectors) cannot be represented unless the embedding dimension is near the number of entities.
- Even after a softmax transformation, the image of possible probability distributions forms a low-dimensional manifold in the output simplex, missing most possible distributions.
Empirical evidence confirms that as the gap between and widens, model accuracy and the fidelity of predicted probabilities degrade. On large graphs with tens or hundreds of thousands of entities, standard KGE output layers are unable to model true answer distributions or fine-grained rankings, regardless of the encoder's internal power.
3. KGE-MoS: Mixture of Softmaxes Output Layer
To address these limitations, KGE-MoS adapts the mixture of softmaxes ("MoS") output architecture from LLMing to KGC. The standard linear projection and softmax is replaced with a mixture over multiple softmax heads: where:
- is the number of mixture (expert) heads.
- is a component-specific projection (e.g., a linear or small nonlinear neural net) applied to the query vector, with learnable parameters .
- is the mixing weight for the -th component, determined via:
with learnable vectors .
Each mixture component produces a softmax distribution over entities; the final distribution is their weighted sum. This nonlinear combination produces a union of low-dimensional manifolds, vastly increasing the expressivity of the output layer even if the underlying embedding dimension remains fixed.
4. Theoretical and Empirical Properties
The mixture-of-softmaxes architecture in KGE-MoS provably allows the output layer to represent a much broader family of rankings and probability distributions:
- With increasing , the union of manifolds expands, approaching full coverage of the output simplex as (and the complexity of per-component projections) increases.
- For permutation (ranking) and label assignments, the MoS design makes more outputs feasible compared to any single-rank-limited linear output.
- Analytical results show that the minimal embedding dimension necessary for perfect coverage exceeds practical values by orders of magnitude, but KGE-MoS achieves comparable flexibility at low parameter cost: parameter overhead scales as , marginal compared to scaling embedding matrices themselves ().
5. Experimental Evaluation and Practical Impact
KGE-MoS has been evaluated on multiple KGC benchmarks spanning small and large knowledge graphs (e.g., FB15k-237, Hetionet, ogbl-biokg, openbiolink) using several base models (DistMult, ComplEx, ConvE). The results demonstrate:
- On large graphs: KGE-MoS consistently boosts both Mean Reciprocal Rank (MRR) and Negative Log Likelihood (NLL) over base architectures and even over higher-dimensional baselines.
- Parameter efficiency: MoS variants at low embedding dimension can match or surpass the accuracy of non-MoS baselines with much higher dimensions.
- Number of experts: Increasing the number of mixture components directly improves expressivity and predictive performance.
- Computational cost: Training step time increases moderately (), but inference cost remains nearly unchanged.
Model | Dataset | NLL (↓) | MRR (↑) | Params | |
---|---|---|---|---|---|
ComplEx | ogbl-biokg | 200 | 4.89 | 0.763 | 22.8M |
ComplEx-MoS | ogbl-biokg | 200 | 4.70 | 0.780 | 23.2M |
ComplEx | ogbl-biokg | 1000 | 4.74 | 0.799 | 195.8M |
ComplEx-MoS | ogbl-biokg | 1000 | 4.42 | 0.824 | 203.8M |
This pattern holds for other architectures and datasets, with performance gains amplifying as dataset scale increases.
6. Significance for Knowledge Graph Modeling
KGE-MoS addresses a fundamental and previously underexplored constraint in KGE models—the inability to fully represent the combinatorial diversity of possible candidate labelings/rankings in large-scale graphs due to the low-rank output structure. By introducing a mixture of softmaxes, KGE-MoS marries the low parameter cost and scalability of vector-matrix scoring with the expressivity demanded by large KGs and multi-answer or probabilistic completion tasks.
A key implication is that KGE-MoS enables flexible, well-calibrated probability estimates crucial for applications involving reasoning, probabilistic inference, or sampling from the predicted answer distributions. The approach is broadly compatible with existing KGE encoders and can be deployed as a drop-in output layer for modern KGC systems, especially beneficial in scenarios where increasing embedding dimensions is computationally infeasible.
7. Summary of Core Features
Property | Standard Linear Output | KGE-MoS Output Layer |
---|---|---|
Output rank | At most | Nonlinear and up to |
Expressivity of output | Limited | Broad, increases with |
Parameter overhead | Linear in | Small (), scalable |
Ranking/probabilistic fit | Degrades in large KGs | Substantially improved in large KGs |
Practical suitability | Poor scaling | Strong scaling to large and complex graphs |
KGE-MoS thus represents a robust, theoretically substantiated, and empirically validated output layer choice for knowledge graph completion, overcoming the rank bottleneck that has constrained the performance of standard KGE models on large-scale tasks (Badreddine et al., 27 Jun 2025).