DESAlign: Dirichlet Energy-Driven Alignment
- Dirichlet Energy-Driven Semantic Alignment (DESAlign) is a framework for aligning entities in multi-modal knowledge graphs by minimizing Dirichlet energy to handle missing modalities.
- It employs explicit Euler propagation and energy constraints to prevent over-smoothing and ensure stable, semantically consistent feature propagation.
- Empirical results demonstrate DESAlign’s superiority over existing methods, showing significant improvements in Hits@1 and MRR under varied modality missingness conditions.
Dirichlet Energy-Driven Semantic Alignment (DESAlign) is a framework for robust multi-modal entity alignment in multi-modal knowledge graphs (MMKGs) that addresses the core challenge of semantic inconsistency due to missing modal attributes. By unifying the learning process and inference under a Dirichlet energy principle, DESAlign minimizes the distortion associated with missing modalities and effectively prevents over-smoothing and performance collapse, thereby advancing the state of the art in multi-modal entity alignment (Wang et al., 2024).
1. Motivation and Problem Context
Multi-Modal Entity Alignment (MMEA) in MMKGs seeks to identify semantically identical entities across knowledge graphs using information from structure, textual, visual, and other attributes. In practice, entities frequently lack one or more modalities (e.g., missing images or sparse text), leading to semantic inconsistency: the data representations are misaligned or incomplete between knowledge graphs. Conventional solutions interpolate or impute missing attributes using simple heuristics, such as sample-of-mean or Gaussian noise, thereby injecting "modality noise" that distorts semantics and triggers over-smoothing (embedding collapse) or unstable performance as missingness increases.
DESAlign proposes a unifying theoretical foundation: semantic smoothness is quantified as Dirichlet energy on the graph, and interpolation of missing modalities should correspond to Dirichlet energy minimization—yielding provably optimal, semantically consistent feature propagation while constraining representation degeneration. This approach replaces ad hoc imputation with principled energy-constrained learning and propagation.
2. Theoretical Foundations
Let denote an undirected multi-modal KG where is the adjacency, and its normalized form; is the Laplacian. An entity feature matrix (embedding row-wise) defines Dirichlet energy: where is the degree matrix.
Given a partition into consistent (), inconsistent-type-1 (), and missing (0) entities, interpolation of 1 minimizing 2 (subject to fixed 3, 4) yields the Euler–Lagrange solution: 5 Direct inversion is cubic in cost. Instead, DESAlign uses explicit Euler propagation: 6 For 7, this reduces to 8. After each step, 9 is reset to original values to enforce the boundary. As 0, the process converges to the optimum.
3. Algorithmic Framework
3.1 Multi-Modal Semantic Learning
The encoder jointly integrates modality-specific and graph-structural embeddings:
- Structure: 2-layer GAT (2 heads), output dimension 300.
- Modalities: FC layers with input/output sizes—relations (BoW 1000→300), text (BoW 1000→300), vision (ResNet-152 2048→300).
- Cross-modal Attention Weighted (CAW) Transformer: Computes attention 1 and confidence 2 over modality 3 for each entity, forming early fusion embeddings 4 and late fusion 5 by concatenation with attention-weighted features.
3.2 Dirichlet Energy Constraints
Hidden representations 6 pass through linear layers 7. For each layer: 8 where 9, 0 are squared minimal/maximal singular values; collapse to zero triggers over-smoothing. DESAlign enforces constraints: 1 with 2 as hyperparameters, limiting collapse or over-separation.
3.3 Semantic Propagation
For inference, semantic propagation uses the boundary-conditioned explicit Euler scheme for missing modalities. Embeddings 3 from source/target graphs undergo propagation, with indices 4 (entities with modalities) and 5 (entities missing modality 6). At each step:
- 7
- 8
After 9 steps, pairwise cosine similarities over 0 are averaged to produce alignment scores 1.
3.4 Loss Functions and Optimization
- Task Losses: Cross-entropy (contrastive) losses on early/late fusions, 2.
- Intra-modal Losses: Contrastive losses per modality 3.
- Confidence Weighting: For a pair 4, confidence 5 lowers the impact of noisy or uncertain modalities.
- AdamW optimizer, learning-rate warmup (15%), batch size 3500, early stopping, 1000 total epochs (split normal/iterative).
4. Empirical Evaluation
4.1 Datasets and Experimental Settings
- Monolingual: FB15K–DB15K and FB15K–YAGO15K, seed-alignment ratios 6.
- Bilingual: DBP15K FR–EN, JA–EN, ZH–EN, each 7.
- Simulated Missing Modalities: Text/image ratios 8, 9 from 5% to 60%.
4.2 Metrics
- Hits@k (0)
- Mean reciprocal rank (MRR)
4.3 Results and Comparative Analysis
DESAlign outperforms 18 non-iterative and several iterative baselines:
- On DBP15K_FR-EN (non-iterative): DESAlign Hits@1 = 82.6%, MEAformer = 77.0%.
- Across all splits, DESAlign improves Hits@1 by 4–12 points, MRR by 2–8 points over non-iteratives, and by 1–4 (Hits@1), 1–3 (MRR) over iterative baselines.
- Under weak supervision (1), DESAlign achieves Hits@1 2 (DBP15K_FR-EN), consistently exceeding baselines.
4.4 Robustness and Ablation
- Under varying 3 (5→60%): baselines’ MRR declines (4), DESAlign holds at 5.
- For 6: baselines 75–79% Hits@1, DESAlign 80–88%, stable even at 95% missing.
- Removing text modality causes the largest drop (–7 Hits@1).
- Eliminating 7 or 8 reduces Hits@1 by 3–5.
- Skipping Semantic Propagation results in performance loss rivaling the absence of an entire modality.
4.5 Efficiency
Semantic Propagation for DBP15K requires 7 seconds, FB-DB 9 seconds (per iteration cost 9). Computation involves only sparse matrix multiplies, suitable for CPU pre-processing. Encoder resource use is comparable to MEAformer.
5. Over-Smoothing, Noise, and Model Stability
DESAlign’s Dirichlet energy constraints across GNN layers mitigate eigendirection collapse, preventing over-smoothing even in scenarios with extreme modality missingness. Cross-modal attention and confidence-based weighting further insulate the model from noisy alignments. Because Semantic Propagation depends solely on graph structure and high-confidence observed modalities, it introduces no additional trainable parameters and does not cause over-fitting.
6. Limitations, Extensions, and Future Work
While robust defaults for hyperparameters (0, 1, propagation steps 2) are effective, auto-tuning could improve flexibility. The explicit Euler propagation may require numerous iterations for large graphs; possible acceleration techniques include Chebyshev polynomials or conjugate-gradient solvers. Potential extensions include adaptation to streaming or dynamic graphs with time-varying Laplacians and incorporation of new modalities (e.g., audio, video) or advanced pretrained encoders like CLIP within the Dirichlet energy optimization framework (Wang et al., 2024).
Overall, DESAlign provides a principled methodology unifying Dirichlet energy-constrained learning and explicit, theoretically-grounded propagation for entity alignment in MMKGs, yielding robust, consistent performance under real-world, modality-heterogeneous settings.