Papers
Topics
Authors
Recent
Search
2000 character limit reached

DESAlign: Dirichlet Energy-Driven Alignment

Updated 2 June 2026
  • Dirichlet Energy-Driven Semantic Alignment (DESAlign) is a framework for aligning entities in multi-modal knowledge graphs by minimizing Dirichlet energy to handle missing modalities.
  • It employs explicit Euler propagation and energy constraints to prevent over-smoothing and ensure stable, semantically consistent feature propagation.
  • Empirical results demonstrate DESAlign’s superiority over existing methods, showing significant improvements in Hits@1 and MRR under varied modality missingness conditions.

Dirichlet Energy-Driven Semantic Alignment (DESAlign) is a framework for robust multi-modal entity alignment in multi-modal knowledge graphs (MMKGs) that addresses the core challenge of semantic inconsistency due to missing modal attributes. By unifying the learning process and inference under a Dirichlet energy principle, DESAlign minimizes the distortion associated with missing modalities and effectively prevents over-smoothing and performance collapse, thereby advancing the state of the art in multi-modal entity alignment (Wang et al., 2024).

1. Motivation and Problem Context

Multi-Modal Entity Alignment (MMEA) in MMKGs seeks to identify semantically identical entities across knowledge graphs using information from structure, textual, visual, and other attributes. In practice, entities frequently lack one or more modalities (e.g., missing images or sparse text), leading to semantic inconsistency: the data representations are misaligned or incomplete between knowledge graphs. Conventional solutions interpolate or impute missing attributes using simple heuristics, such as sample-of-mean or Gaussian noise, thereby injecting "modality noise" that distorts semantics and triggers over-smoothing (embedding collapse) or unstable performance as missingness increases.

DESAlign proposes a unifying theoretical foundation: semantic smoothness is quantified as Dirichlet energy on the graph, and interpolation of missing modalities should correspond to Dirichlet energy minimization—yielding provably optimal, semantically consistent feature propagation while constraining representation degeneration. This approach replaces ad hoc imputation with principled energy-constrained learning and propagation.

2. Theoretical Foundations

Let G=(E,R,A,V)G=(E,R,A,V) denote an undirected multi-modal KG where AA is the adjacency, and A~\tilde{A} its normalized form; Δ=IA~\Delta=I-\tilde{A} is the Laplacian. An entity feature matrix XRN×dX\in\mathbb{R}^{N\times d} (embedding f:ERdf:E\to\mathbb{R}^d row-wise) defines Dirichlet energy: L(X)=trace(XΔX)=12i,jAijXi/Dii+1Xj/Djj+12\mathcal{L}(X) = \operatorname{trace}(X^\top \Delta X) = \frac{1}{2} \sum_{i,j} A_{ij} \|X_i/\sqrt{D_{ii}+1} - X_j/\sqrt{D_{jj}+1}\|^2 where DD is the degree matrix.

Given a partition into consistent (cc), inconsistent-type-1 (o1o_1), and missing (AA0) entities, interpolation of AA1 minimizing AA2 (subject to fixed AA3, AA4) yields the Euler–Lagrange solution: AA5 Direct inversion is cubic in cost. Instead, DESAlign uses explicit Euler propagation: AA6 For AA7, this reduces to AA8. After each step, AA9 is reset to original values to enforce the boundary. As A~\tilde{A}0, the process converges to the optimum.

3. Algorithmic Framework

3.1 Multi-Modal Semantic Learning

The encoder jointly integrates modality-specific and graph-structural embeddings:

  • Structure: 2-layer GAT (2 heads), output dimension 300.
  • Modalities: FC layers with input/output sizes—relations (BoW 1000→300), text (BoW 1000→300), vision (ResNet-152 2048→300).
  • Cross-modal Attention Weighted (CAW) Transformer: Computes attention A~\tilde{A}1 and confidence A~\tilde{A}2 over modality A~\tilde{A}3 for each entity, forming early fusion embeddings A~\tilde{A}4 and late fusion A~\tilde{A}5 by concatenation with attention-weighted features.

3.2 Dirichlet Energy Constraints

Hidden representations A~\tilde{A}6 pass through linear layers A~\tilde{A}7. For each layer: A~\tilde{A}8 where A~\tilde{A}9, Δ=IA~\Delta=I-\tilde{A}0 are squared minimal/maximal singular values; collapse to zero triggers over-smoothing. DESAlign enforces constraints: Δ=IA~\Delta=I-\tilde{A}1 with Δ=IA~\Delta=I-\tilde{A}2 as hyperparameters, limiting collapse or over-separation.

3.3 Semantic Propagation

For inference, semantic propagation uses the boundary-conditioned explicit Euler scheme for missing modalities. Embeddings Δ=IA~\Delta=I-\tilde{A}3 from source/target graphs undergo propagation, with indices Δ=IA~\Delta=I-\tilde{A}4 (entities with modalities) and Δ=IA~\Delta=I-\tilde{A}5 (entities missing modality Δ=IA~\Delta=I-\tilde{A}6). At each step:

  1. Δ=IA~\Delta=I-\tilde{A}7
  2. Δ=IA~\Delta=I-\tilde{A}8

After Δ=IA~\Delta=I-\tilde{A}9 steps, pairwise cosine similarities over XRN×dX\in\mathbb{R}^{N\times d}0 are averaged to produce alignment scores XRN×dX\in\mathbb{R}^{N\times d}1.

3.4 Loss Functions and Optimization

  • Task Losses: Cross-entropy (contrastive) losses on early/late fusions, XRN×dX\in\mathbb{R}^{N\times d}2.
  • Intra-modal Losses: Contrastive losses per modality XRN×dX\in\mathbb{R}^{N\times d}3.
  • Confidence Weighting: For a pair XRN×dX\in\mathbb{R}^{N\times d}4, confidence XRN×dX\in\mathbb{R}^{N\times d}5 lowers the impact of noisy or uncertain modalities.
  • AdamW optimizer, learning-rate warmup (15%), batch size 3500, early stopping, 1000 total epochs (split normal/iterative).

4. Empirical Evaluation

4.1 Datasets and Experimental Settings

  • Monolingual: FB15K–DB15K and FB15K–YAGO15K, seed-alignment ratios XRN×dX\in\mathbb{R}^{N\times d}6.
  • Bilingual: DBP15K FR–EN, JA–EN, ZH–EN, each XRN×dX\in\mathbb{R}^{N\times d}7.
  • Simulated Missing Modalities: Text/image ratios XRN×dX\in\mathbb{R}^{N\times d}8, XRN×dX\in\mathbb{R}^{N\times d}9 from 5% to 60%.

4.2 Metrics

4.3 Results and Comparative Analysis

DESAlign outperforms 18 non-iterative and several iterative baselines:

  • On DBP15K_FR-EN (non-iterative): DESAlign Hits@1 = 82.6%, MEAformer = 77.0%.
  • Across all splits, DESAlign improves Hits@1 by 4–12 points, MRR by 2–8 points over non-iteratives, and by 1–4 (Hits@1), 1–3 (MRR) over iterative baselines.
  • Under weak supervision (f:ERdf:E\to\mathbb{R}^d1), DESAlign achieves Hits@1 f:ERdf:E\to\mathbb{R}^d2 (DBP15K_FR-EN), consistently exceeding baselines.

4.4 Robustness and Ablation

  • Under varying f:ERdf:E\to\mathbb{R}^d3 (5→60%): baselines’ MRR declines (f:ERdf:E\to\mathbb{R}^d4), DESAlign holds at f:ERdf:E\to\mathbb{R}^d5.
  • For f:ERdf:E\to\mathbb{R}^d6: baselines 75–79% Hits@1, DESAlign 80–88%, stable even at 95% missing.
  • Removing text modality causes the largest drop (–7 Hits@1).
  • Eliminating f:ERdf:E\to\mathbb{R}^d7 or f:ERdf:E\to\mathbb{R}^d8 reduces Hits@1 by 3–5.
  • Skipping Semantic Propagation results in performance loss rivaling the absence of an entire modality.

4.5 Efficiency

Semantic Propagation for DBP15K requires 7 seconds, FB-DB 9 seconds (per iteration cost f:ERdf:E\to\mathbb{R}^d9). Computation involves only sparse matrix multiplies, suitable for CPU pre-processing. Encoder resource use is comparable to MEAformer.

5. Over-Smoothing, Noise, and Model Stability

DESAlign’s Dirichlet energy constraints across GNN layers mitigate eigendirection collapse, preventing over-smoothing even in scenarios with extreme modality missingness. Cross-modal attention and confidence-based weighting further insulate the model from noisy alignments. Because Semantic Propagation depends solely on graph structure and high-confidence observed modalities, it introduces no additional trainable parameters and does not cause over-fitting.

6. Limitations, Extensions, and Future Work

While robust defaults for hyperparameters (L(X)=trace(XΔX)=12i,jAijXi/Dii+1Xj/Djj+12\mathcal{L}(X) = \operatorname{trace}(X^\top \Delta X) = \frac{1}{2} \sum_{i,j} A_{ij} \|X_i/\sqrt{D_{ii}+1} - X_j/\sqrt{D_{jj}+1}\|^20, L(X)=trace(XΔX)=12i,jAijXi/Dii+1Xj/Djj+12\mathcal{L}(X) = \operatorname{trace}(X^\top \Delta X) = \frac{1}{2} \sum_{i,j} A_{ij} \|X_i/\sqrt{D_{ii}+1} - X_j/\sqrt{D_{jj}+1}\|^21, propagation steps L(X)=trace(XΔX)=12i,jAijXi/Dii+1Xj/Djj+12\mathcal{L}(X) = \operatorname{trace}(X^\top \Delta X) = \frac{1}{2} \sum_{i,j} A_{ij} \|X_i/\sqrt{D_{ii}+1} - X_j/\sqrt{D_{jj}+1}\|^22) are effective, auto-tuning could improve flexibility. The explicit Euler propagation may require numerous iterations for large graphs; possible acceleration techniques include Chebyshev polynomials or conjugate-gradient solvers. Potential extensions include adaptation to streaming or dynamic graphs with time-varying Laplacians and incorporation of new modalities (e.g., audio, video) or advanced pretrained encoders like CLIP within the Dirichlet energy optimization framework (Wang et al., 2024).

Overall, DESAlign provides a principled methodology unifying Dirichlet energy-constrained learning and explicit, theoretically-grounded propagation for entity alignment in MMKGs, yielding robust, consistent performance under real-world, modality-heterogeneous settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dirichlet Energy-Driven Semantic Alignment (DESAlign).