Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
11 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

KGE-MoS: Mixture Softmax for KGC

Updated 1 July 2025
  • KGE-MoS is a mixture-based output layer for KGE models that overcomes the rank bottleneck in knowledge graph completion tasks.
  • It employs multiple softmax heads to create a union of low-dimensional manifolds, enabling a richer and more flexible representation of output distributions.
  • Empirical studies show that KGE-MoS achieves higher Mean Reciprocal Rank and lower Negative Log Likelihood while maintaining parameter efficiency across various benchmarks.

KGE-MoS is a mixture-based output layer for knowledge graph embedding (KGE) models, designed to break the expressivity limitations imposed by the rank bottleneck in knowledge graph completion (KGC) tasks. The approach draws inspiration from mixture of softmaxes techniques in LLMing, enabling KGC models to capture a richer variety of ranking and probability distributions over large knowledge graphs without substantially increasing parameter count.

1. Rank Bottleneck in Knowledge Graph Completion

Most KGE models for KGC predict the plausibility of triples (s,r,o)(s, r, o) by encoding the subject (ss), relation (rr), and scoring all possible object (oo) candidates: ϕ(s,r,o)=hs,reo,\phi(s, r, o) = \mathbf{h}_{s,r}^\top \mathbf{e}_o, where hs,r\mathbf{h}_{s,r} is the query vector resulting from composing es\mathbf{e}_s and the relation parameters, and eo\mathbf{e}_o is the embedding of a candidate object entity.

Candidate objects are typically scored via a single vector-matrix multiplication: zs,r,:=hs,rE,\mathbf{z}_{s,r,:} = \mathbf{h}_{s,r} \mathbf{E}^\top, with ERE×d\mathbf{E} \in \mathbb{R}^{|\mathcal{E}| \times d}, where E|\mathcal{E}| is the number of entities and dd is the embedding dimension (dEd \ll |\mathcal{E}| in most practical settings).

This output layer is fundamentally limited in rank: regardless of the number of queries or candidate entities, the output score matrix has rank at most dd. Consequently, only a low-dimensional subspace of ranking permutations or probability distributions over entities can be realized, forming a rank bottleneck. This phenomenon—also termed the softmax bottleneck—implies that many possible output distributions and rankings cannot be represented at all when dd is much smaller than E|\mathcal{E}|.

2. Effects and Diagnosis of the Rank Bottleneck

Theoretical analysis establishes that:

  • If the adjacency matrix of the target knowledge graph has rank r>d+1r > d+1, perfect reconstruction of target labelings or distributions is impossible.
  • For sign (multi-label) or ranking tasks, many label assignments or rankings (permutation vectors) cannot be represented unless the embedding dimension is near the number of entities.
  • Even after a softmax transformation, the image of possible probability distributions forms a low-dimensional manifold in the output simplex, missing most possible distributions.

Empirical evidence confirms that as the gap between dd and E|\mathcal{E}| widens, model accuracy and the fidelity of predicted probabilities degrade. On large graphs with tens or hundreds of thousands of entities, standard KGE output layers are unable to model true answer distributions or fine-grained rankings, regardless of the encoder's internal power.

3. KGE-MoS: Mixture of Softmaxes Output Layer

To address these limitations, KGE-MoS adapts the mixture of softmaxes ("MoS") output architecture from LLMing to KGC. The standard linear projection and softmax is replaced with a mixture over multiple softmax heads: P(Os,r)=k=1Kπs,r,k  softmax(fθk(hs,r)E),P(O|s,r) = \sum_{k=1}^{K} \pi_{s,r,k} \; \mathrm{softmax}\left( f_{\theta_k}(\mathbf{h}_{s,r}) \mathbf{E}^\top \right), where:

  • KK is the number of mixture (expert) heads.
  • fθkf_{\theta_k} is a component-specific projection (e.g., a linear or small nonlinear neural net) applied to the query vector, with learnable parameters θk\theta_k.
  • πs,r,k\pi_{s,r,k} is the mixing weight for the kk-th component, determined via:

πs,r,k=exp(hs,rωk)kexp(hs,rωk),\pi_{s,r,k} = \frac{\exp(\mathbf{h}_{s,r}^\top \omega_k)}{\sum_{k'} \exp(\mathbf{h}_{s,r}^\top \omega_{k'})},

with learnable vectors ωk\omega_k.

Each mixture component produces a softmax distribution over entities; the final distribution is their weighted sum. This nonlinear combination produces a union of low-dimensional manifolds, vastly increasing the expressivity of the output layer even if the underlying embedding dimension dd remains fixed.

4. Theoretical and Empirical Properties

The mixture-of-softmaxes architecture in KGE-MoS provably allows the output layer to represent a much broader family of rankings and probability distributions:

  • With increasing KK, the union of manifolds expands, approaching full coverage of the output simplex as KK (and the complexity of per-component projections) increases.
  • For permutation (ranking) and label assignments, the MoS design makes more outputs feasible compared to any single-rank-limited linear output.
  • Analytical results show that the minimal embedding dimension necessary for perfect coverage exceeds practical values by orders of magnitude, but KGE-MoS achieves comparable flexibility at low parameter cost: parameter overhead scales as O(Kd2)O(Kd^2), marginal compared to scaling embedding matrices themselves (O(d(E+R))O(d(|\mathcal{E}|+|\mathcal{R}|))).

5. Experimental Evaluation and Practical Impact

KGE-MoS has been evaluated on multiple KGC benchmarks spanning small and large knowledge graphs (e.g., FB15k-237, Hetionet, ogbl-biokg, openbiolink) using several base models (DistMult, ComplEx, ConvE). The results demonstrate:

  • On large graphs: KGE-MoS consistently boosts both Mean Reciprocal Rank (MRR) and Negative Log Likelihood (NLL) over base architectures and even over higher-dimensional baselines.
  • Parameter efficiency: MoS variants at low embedding dimension can match or surpass the accuracy of non-MoS baselines with much higher dimensions.
  • Number of experts: Increasing the number of mixture components KK directly improves expressivity and predictive performance.
  • Computational cost: Training step time increases moderately (1.72.7×1.7{-}2.7\times), but inference cost remains nearly unchanged.
Model Dataset dd NLL (↓) MRR (↑) Params
ComplEx ogbl-biokg 200 4.89 0.763 22.8M
ComplEx-MoS ogbl-biokg 200 4.70 0.780 23.2M
ComplEx ogbl-biokg 1000 4.74 0.799 195.8M
ComplEx-MoS ogbl-biokg 1000 4.42 0.824 203.8M

This pattern holds for other architectures and datasets, with performance gains amplifying as dataset scale increases.

6. Significance for Knowledge Graph Modeling

KGE-MoS addresses a fundamental and previously underexplored constraint in KGE models—the inability to fully represent the combinatorial diversity of possible candidate labelings/rankings in large-scale graphs due to the low-rank output structure. By introducing a mixture of softmaxes, KGE-MoS marries the low parameter cost and scalability of vector-matrix scoring with the expressivity demanded by large KGs and multi-answer or probabilistic completion tasks.

A key implication is that KGE-MoS enables flexible, well-calibrated probability estimates crucial for applications involving reasoning, probabilistic inference, or sampling from the predicted answer distributions. The approach is broadly compatible with existing KGE encoders and can be deployed as a drop-in output layer for modern KGC systems, especially beneficial in scenarios where increasing embedding dimensions is computationally infeasible.

7. Summary of Core Features

Property Standard Linear Output KGE-MoS Output Layer
Output rank At most dd Nonlinear and up to KdK \cdot d
Expressivity of output Limited Broad, increases with KK
Parameter overhead Linear in dEd\cdot|\mathcal{E}| Small (O(Kd2)O(Kd^2)), scalable
Ranking/probabilistic fit Degrades in large KGs Substantially improved in large KGs
Practical suitability Poor scaling Strong scaling to large and complex graphs

KGE-MoS thus represents a robust, theoretically substantiated, and empirically validated output layer choice for knowledge graph completion, overcoming the rank bottleneck that has constrained the performance of standard KGE models on large-scale tasks (Badreddine et al., 27 Jun 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.