DKGH-MoE: Domain-Knowledge Guided Hybrid MoE
- DKGH-MoE is a hybrid architecture that fuses neural experts with explicit domain priors, enhancing performance in structured tasks.
- It employs parallel expert modules and sparse routing to balance data-driven signals with expert inputs such as clinical gaze maps and analytic models.
- DKGH-MoE improves predictive accuracy and interpretability in applications like medical imaging and hybrid dynamical systems, enabling efficient expert specialization.
Domain-Knowledge-Guided Hybrid Mixture-of-Experts (DKGH-MoE) is an architectural paradigm in neural modeling that systematically fuses data-driven representation learning with explicit domain priors. By combining the statistical discovery strengths of neural experts with the inductive constraints provided by expert knowledge (such as clinical gaze maps, physical analytic models, or established heuristics), DKGH-MoE addresses the limitations of conventional Mixture-of-Experts (MoE) approaches in domains characterized by either limited data or structured, expert-annotated information. This synthesis yields enhanced predictive robustness, interpretability, and efficiency across tasks including medical imaging and hybrid dynamical systems (Gu et al., 25 Jan 2026, Ahn et al., 2020).
1. Core Architectural Principles
At the abstract level, DKGH-MoE is built from parallel expert ensembles instantiated as neural networks and domain-guided modules. In medical image classification, for instance, the architecture operates on a feature map sourced from a pretrained backbone. The input is routed to:
- Data-Driven MoE (DD-MoE): experts trained solely on latent features.
- Domain-Expert MoE (DE-MoE): experts conditioned on features extracted from structured priors such as clinician-generated heatmaps obtained via gaze tracking.
Sparse router networks, realized as Top-K index selectors and sparse-softmax normalizers, compute expert activations (DD-MoE) and (DE-MoE). A fusion module, parameterized by a learned gate , combines DD and DE outputs according to
where and are respective branch outputs. This modular construct replaces standard backbone blocks and is both plug-and-play and interpretable (Gu et al., 25 Jan 2026).
An analogous topology emerges in hybrid dynamical system modeling, where a two-level mixture scheme encompasses:
- Top (competitive) layer: mode experts gated by (softmax over contact/operational regimes).
- Bottom (cooperative) layer: Within each , two sub-experts (black-box neural) and (white-box analytic), blended via (Ahn et al., 2020).
2. Mathematical Formulation
Let denote the primary input. In the medical context:
- Data-Driven Experts:
- Domain-Expert Experts:
Routers utilize pooled features and clinical priors , typically the output of a small CNN applied to gaze heatmaps.
- Data-driven routing:
Top-K indices are selected, and activations:
Analogously for the domain-expert branch.
- Outputs are aggregated:
- The fusion gate is:
Final prediction combines branches accordingly.
In physics-informed dynamical systems, the nested MoE formula is:
with and .
3. Injection and Utilization of Domain Knowledge
DKGH-MoE operationalizes domain priors in two paradigms:
- Medical Imaging: DE-MoEs utilize clinical gaze maps, encoding expert attention via convolutional embedding ; routers then bin spatial regions by gaze intensity, enforcing specialization on clinically salient areas. The fusion gate adaptively decides the contribution of expert insight versus raw features, attending more to priors when gaze is sharp/reliable (Gu et al., 25 Jan 2026).
- Hybrid Dynamical Systems: Top-level mode gating hones in on known discrete operating regimes (e.g., contact modes). Within each, analytic white-box models implement first-principles dynamics (Euler–Lagrange, contact Jacobians) with trainable parameters initialized from physical heuristics, while black-box MLP sub-experts absorb residual effects. Bottom-level gating adjusts reliance on physical priors versus learned correction as dictated by context (Ahn et al., 2020).
Domain knowledge is thus not simply regularizing; it explicitly shapes expert specialization, routing, and fusion.
4. Training Objectives and Regularization
DKGH-MoE models employ compound objectives:
- Classification loss (): Cross-entropy (in medical imaging) or mean squared error/residual (in dynamical models).
- Routing load-balance loss (): Enforces expert utilization, discouraging collapse. For medical blocks:
for average weight and input fraction per expert; DE-MoE analogously.
- Regularizers: weight penalties, optional entropy/diversity bonuses to ensure gating discriminates modes and avoids collapse.
Hyperparameters for optimization include Adam (lr ), StepLR decay, batch size, and data augmentation strategies (Gu et al., 25 Jan 2026, Ahn et al., 2020). For reinforcement control, model integration into MPC or MPPI leverages the DKGH-MoE for predictive rollouts.
5. Empirical Evaluations and Benchmarks
DKGH-MoE has been empirically assessed in representative domains, notably:
Medical Image Classification (INBreast Mammography) (Gu et al., 25 Jan 2026)
| Model | ACC (Dense/Sparse) | AUC (Dense/Sparse) |
|---|---|---|
| Baseline | 62.9 / – | 70.5 / – |
| DD-MoE | 70.5 / 62.6 | 72.5 / 68.1 |
| DE-MoE | 65.0 / 72.8 | 72.0 / 68.9 |
| DKGH-MoE | 77.8 / 79.1 | 82.9 / 80.8 |
Sparse routing (K=1) with DKGH-MoE matches or exceeds dense performance, demonstrating efficiency.
Hybrid Dynamical Systems (e.g., Hopper, Cart+Wall) (Ahn et al., 2020)
- DKGH-MoE achieves lower RMSE than pure white-box (low variance, high bias) or pure black-box (high variance, low bias) models across various sample sizes.
- Top-level gating divides phase-space by contact mode; bottom-level gating increases reliance on black-box models in regions of high unmodeled friction or damping.
- Data-efficiency: DKGH-MoE attains ground-truth error with – samples, outperforming pure neural nets.
6. Interpretability and Specialization
A notable feature is interpretability. In medical imaging, DE-MoE experts bin spatial regions based on gaze heatmap intensity, mapping directly to human-interpretable diagnostic attention patterns (e.g., distinguishing diffuse, moderate, high-intensity, and minimal gaze regions). Fusion gate weights provide explicit quantification of relative reliance on clinical priors versus data-driven cues.
In dynamical systems, mode gating and intra-mode blending allow detailed analysis of when and why model predictions shift between physical laws and data-driven corrections, fostering transparency in real-time control tasks.
7. Implementation and Practical Considerations
DKGH-MoE modules are designed for modular integration into mainstream deep learning workflows. For medical imaging, backbone architectures (ResNet-18/50) are replaced block-wise by DKGH-MoE blocks. Each expert network typically mirrors the base architecture's blocks.
In RL/model-based control, mode selection mirrors known hybrid structure; expert sizes are chosen based on complexity ($1$–$2$ hidden layers, $32$–$128$ units for sub-MLPs), and entropy regularization guards against expert collapse. Sufficient convergence is reported after $100$–$200$ epochs on moderate datasets.
Empirical guidelines emphasize matching the number of DE/physics mode experts to known discrete regimes and balance between under- and over-splitting for variance/bias control.
DKGH-MoE represents a formal paradigm for integrating structured expert knowledge into neural expert ensembles, enhancing data efficiency and interpretability in domains demanding both statistical generalization and expert-aligned priors (Gu et al., 25 Jan 2026, Ahn et al., 2020).