DKGH-MoE: Domain-Knowledge Guided Hybrid MoE

Updated 1 February 2026

DKGH-MoE is a hybrid architecture that fuses neural experts with explicit domain priors, enhancing performance in structured tasks.
It employs parallel expert modules and sparse routing to balance data-driven signals with expert inputs such as clinical gaze maps and analytic models.
DKGH-MoE improves predictive accuracy and interpretability in applications like medical imaging and hybrid dynamical systems, enabling efficient expert specialization.

Domain-Knowledge-Guided Hybrid Mixture-of-Experts (DKGH-MoE) is an architectural paradigm in neural modeling that systematically fuses data-driven representation learning with explicit domain priors. By combining the statistical discovery strengths of neural experts with the inductive constraints provided by expert knowledge (such as clinical gaze maps, physical analytic models, or established heuristics), DKGH-MoE addresses the limitations of conventional Mixture-of-Experts (MoE) approaches in domains characterized by either limited data or structured, expert-annotated information. This synthesis yields enhanced predictive robustness, interpretability, and efficiency across tasks including medical imaging and hybrid dynamical systems (Gu et al., 25 Jan 2026, Ahn et al., 2020).

1. Core Architectural Principles

At the abstract level, DKGH-MoE is built from parallel expert ensembles instantiated as neural networks and domain-guided modules. In medical image classification, for instance, the architecture operates on a feature map $x \in \mathbb{R}^{C\times H\times W}$ sourced from a pretrained backbone. The input is routed to:

Data-Driven MoE (DD-MoE): $N$ experts $\{f_1,\dots,f_N\}$ trained solely on latent features.
Domain-Expert MoE (DE-MoE): $M$ experts $\{g_1,\dots,g_M\}$ conditioned on features extracted from structured priors such as clinician-generated heatmaps obtained via gaze tracking.

Sparse router networks, realized as Top-K index selectors and sparse-softmax normalizers, compute expert activations $\alpha_i(x)$ (DD-MoE) and $\beta_j(x)$ (DE-MoE). A fusion module, parameterized by a learned gate $p(x)$ , combines DD and DE outputs according to

$\hat{y}(x) = p(x)\, h^{(\text{DE})}(x) + (1-p(x))\, h^{(\text{DD})}(x)$

where $h^{(\text{DD})}(x)$ and $h^{(\text{DE})}(x)$ are respective branch outputs. This modular construct replaces standard backbone blocks and is both plug-and-play and interpretable (Gu et al., 25 Jan 2026).

An analogous topology emerges in hybrid dynamical system modeling, where a two-level mixture scheme encompasses:

Top (competitive) layer: $M$ mode experts $o^{(i)}$ gated by $h(x)\in\Delta^M$ (softmax over contact/operational regimes).
Bottom (cooperative) layer: Within each $i$ , two sub-experts $f^{(i)}_1$ (black-box neural) and $f^{(i)}_2$ (white-box analytic), blended via $g^{(i)}(x)\in\Delta^2$ (Ahn et al., 2020).

2. Mathematical Formulation

Let $x$ denote the primary input. In the medical context:

Data-Driven Experts: $f_i : \mathbb{R}^{C\times H\times W} \rightarrow \mathbb{R}^D$
Domain-Expert Experts: $g_j : \mathbb{R}^{C\times H\times W} \rightarrow \mathbb{R}^D$

Routers utilize pooled features $x_f = \text{GAP}(x)$ and clinical priors $x_\text{exp}$ , typically the output of a small CNN applied to gaze heatmaps.

Data-driven routing:

$r_i^{(\text{DD})}(x_f) = w_i^T x_f + b_i$

Top-K indices $S^{(\text{DD})}$ are selected, and activations:

$\alpha_i(x) = \begin{cases} \dfrac{\exp(r_i^{(\text{DD})}(x_f))}{\sum_{k\in S^{(\text{DD})}} \exp(r_k^{(\text{DD})}(x_f))}, & i \in S^{(\text{DD})} \ 0, & \text{otherwise} \end{cases}$

Analogously for the domain-expert branch.

Outputs are aggregated:

$h^{(\text{DD})}(x) = \sum_{i\in S^{(\text{DD})}} \alpha_i(x) f_i(x)$

$h^{(\text{DE})}(x) = \sum_{j\in S^{(\text{DE})}} \beta_j(x) g_j(x)$

The fusion gate is:

$p(x) = \sigma(w_p^T [x_f; x_\text{exp}] + b_p)$

Final prediction combines branches accordingly.

In physics-informed dynamical systems, the nested MoE formula is:

$y(x) = \sum_{i=1}^M h_i(x) \Big( \sum_{j=1}^2 g^{(i)}_j(x) f^{(i)}_j(x) \Big)$

with $h(x)=\text{softmax}(W_h x)$ and $g^{(i)}(x)=\text{softmax}(W_g^{(i)} x)$ .

3. Injection and Utilization of Domain Knowledge

DKGH-MoE operationalizes domain priors in two paradigms:

Medical Imaging: DE-MoEs utilize clinical gaze maps, encoding expert attention via convolutional embedding $x_\text{exp}$ ; routers then bin spatial regions by gaze intensity, enforcing specialization on clinically salient areas. The fusion gate $p(x)$ adaptively decides the contribution of expert insight versus raw features, attending more to priors when gaze is sharp/reliable (Gu et al., 25 Jan 2026).
Hybrid Dynamical Systems: Top-level mode gating hones in on known discrete operating regimes (e.g., contact modes). Within each, analytic white-box models implement first-principles dynamics (Euler–Lagrange, contact Jacobians) with trainable parameters initialized from physical heuristics, while black-box MLP sub-experts absorb residual effects. Bottom-level gating adjusts reliance on physical priors versus learned correction as dictated by context (Ahn et al., 2020).

Domain knowledge is thus not simply regularizing; it explicitly shapes expert specialization, routing, and fusion.

4. Training Objectives and Regularization

DKGH-MoE models employ compound objectives:

Classification loss ( $L_\text{cls}$ ): Cross-entropy (in medical imaging) or mean squared error/residual (in dynamical models).
Routing load-balance loss ( $L_\text{lb}$ ): Enforces expert utilization, discouraging collapse. For medical blocks:

$L_\text{lb} = \sum_{i=1}^N \bar{p}_i \cdot f_i$

for average weight $\bar{p}_i$ and input fraction $f_i$ per expert; DE-MoE analogously.

Regularizers: $\ell_2$ weight penalties, optional entropy/diversity bonuses to ensure gating discriminates modes and avoids collapse.

Hyperparameters for optimization include Adam (lr $= 5\times10^{-4}$ ), StepLR decay, batch size, and data augmentation strategies (Gu et al., 25 Jan 2026, Ahn et al., 2020). For reinforcement control, model integration into MPC or MPPI leverages the DKGH-MoE for predictive rollouts.

5. Empirical Evaluations and Benchmarks

DKGH-MoE has been empirically assessed in representative domains, notably:

Model	ACC (Dense/Sparse)	AUC (Dense/Sparse)
Baseline	62.9 / –	70.5 / –
DD-MoE	70.5 / 62.6	72.5 / 68.1
DE-MoE	65.0 / 72.8	72.0 / 68.9
DKGH-MoE	77.8 / 79.1	82.9 / 80.8

Sparse routing (K=1) with DKGH-MoE matches or exceeds dense performance, demonstrating efficiency.

DKGH-MoE achieves lower RMSE than pure white-box (low variance, high bias) or pure black-box (high variance, low bias) models across various sample sizes.
Top-level gating divides phase-space by contact mode; bottom-level gating increases reliance on black-box models in regions of high unmodeled friction or damping.
Data-efficiency: DKGH-MoE attains $<3\times$ ground-truth error with $2^{10}$ – $2^{12}$ samples, outperforming pure neural nets.

6. Interpretability and Specialization

A notable feature is interpretability. In medical imaging, DE-MoE experts bin spatial regions based on gaze heatmap intensity, mapping directly to human-interpretable diagnostic attention patterns (e.g., distinguishing diffuse, moderate, high-intensity, and minimal gaze regions). Fusion gate weights provide explicit quantification of relative reliance on clinical priors versus data-driven cues.

In dynamical systems, mode gating and intra-mode blending allow detailed analysis of when and why model predictions shift between physical laws and data-driven corrections, fostering transparency in real-time control tasks.

7. Implementation and Practical Considerations

DKGH-MoE modules are designed for modular integration into mainstream deep learning workflows. For medical imaging, backbone architectures (ResNet-18/50) are replaced block-wise by DKGH-MoE blocks. Each expert network typically mirrors the base architecture's blocks.

In RL/model-based control, mode selection mirrors known hybrid structure; expert sizes are chosen based on complexity ($1$–$2$ hidden layers, $32$–$128$ units for sub-MLPs), and entropy regularization guards against expert collapse. Sufficient convergence is reported after $100$–$200$ epochs on moderate datasets.

Empirical guidelines emphasize matching the number of DE/physics mode experts to known discrete regimes and balance between under- and over-splitting for variance/bias control.

DKGH-MoE represents a formal paradigm for integrating structured expert knowledge into neural expert ensembles, enhancing data efficiency and interpretability in domains demanding both statistical generalization and expert-aligned priors (Gu et al., 25 Jan 2026, Ahn et al., 2020).

Markdown Report Issue Upgrade to Chat

References (2)

Domain-Expert-Guided Hybrid Mixture-of-Experts for Medical AI: Integrating Data-Driven Learning with Clinical Priors (2026)

Nested Mixture of Experts: Cooperative and Competitive Learning of Hybrid Dynamical System (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Domain-Knowledge-Guided Hybrid MoE (DKGH-MoE).

DKGH-MoE: Domain-Knowledge Guided Hybrid MoE

1. Core Architectural Principles

2. Mathematical Formulation

3. Injection and Utilization of Domain Knowledge

4. Training Objectives and Regularization

5. Empirical Evaluations and Benchmarks

Medical Image Classification (INBreast Mammography) (Gu et al., 25 Jan 2026)

Hybrid Dynamical Systems (e.g., Hopper, Cart+Wall) (Ahn et al., 2020)

6. Interpretability and Specialization

7. Implementation and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

DKGH-MoE: Domain-Knowledge Guided Hybrid MoE

1. Core Architectural Principles

2. Mathematical Formulation

3. Injection and Utilization of Domain Knowledge

4. Training Objectives and Regularization

5. Empirical Evaluations and Benchmarks

Medical Image Classification (INBreast Mammography) (Gu et al., 25 Jan 2026)

Hybrid Dynamical Systems (e.g., Hopper, Cart+Wall) (Ahn et al., 2020)

6. Interpretability and Specialization

7. Implementation and Practical Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics