Papers
Topics
Authors
Recent
2000 character limit reached

Drug-Disease Conditioned SMoE in MMCTOP

Updated 2 January 2026
  • The paper presents a novel SMoE that leverages drug and disease embeddings through efficient top-k expert routing to significantly improve clinical trial AUC scores.
  • It integrates modality-specific embeddings from ChemBERTa and ClinicalBERT to fuse molecular, protocol, and disease information in a unified framework.
  • Empirical results demonstrate that sparse expert selection reduces computation by 6–8× while achieving substantial performance gains across multiple clinical phases.

A drug–disease–conditioned sparse Mixture-of-Experts (SMoE) is a specialized architecture in multimodal biomedical informatics for tasks such as clinical trial outcome prediction. Within the MMCTOP (MultiModal Clinical-Trial Outcome Prediction) framework, SMoE operates as a “plug-in” module atop multi-encoder fusion backbones to provide expert-driven specialization guided by the specific drug and disease context of each data instance. Distinct from traditional dense mixture-of-experts, SMoE applies top-kk routing to ensure computational efficiency, while conditioning the expert selection explicitly on salient molecular and indication features. Empirically, this mechanism yields material improvements in metrics such as AUC, precision, and F1, with only modest increase in computational cost (Aparício et al., 26 Dec 2025).

1. Fusion Backbone and Schematic Overview

The MMCTOP workflow integrates three principal biomedical modalities per trial:

  • Molecular structure: Encoded via ChemBERTa, outputting a 768-dimensional vector representing the drug in SMILES format.
  • Protocol and eligibility information: Encoded using ClinicalBERT, again producing 768-dimensional vectors for both long-form and schema-narrativized textual metadata.
  • Disease ontology labels: Encoded as projected ClinicalBERT embeddings of the indication label.

These modality-specific embeddings {zmol,zproto,zonto}\{z_{\mathrm{mol}}, z_{\mathrm{proto}}, z_{\mathrm{onto}}\} are concatenated to form a 2304-dimensional fusion vector hh. The molecular and disease projections dd and ss serve both as fused input and as conditioning signals for the SMoE gating.

A condensed pass through the architecture proceeds as follows:

  • Each modality is independently encoded and linearly projected.
  • SMoE computes a context vector c=fcond(d,s)c = f_{\mathrm{cond}}(d, s).
  • The gating network computes logits over EE experts, softmaxed (with temperature scaling) to produce expert assignment scores.
  • Top-kk experts are selected and executed in parallel; their outputs are weighted and summed to yield yy.
  • The output is concatenated with dd and ss, and passed to a final classification head, outputting the calibrated probability p^\hat p.

2. Mathematical Formulation

The SMoE block is mathematically defined as:

  • Conditioning Network:

c=fcond(d,s)=ReLU(W2ReLU(W1[ds]+b1)+b2)Rdcc = f_{\mathrm{cond}}(d, s) = \mathrm{ReLU}(W_2\,\mathrm{ReLU}(W_1\,[d\|s] + b_1) + b_2) \in\mathbb{R}^{d_c}

where d,sR768d, s \in \mathbb{R}^{768}, and typically dc=512d_c = 512.

  • Gating with Temperature Scaling:

ϕ(c,h)=Wg(2)(ReLU(Wg(1)[ch]+bg(1)))+bg(2)RE\phi(c, h) = W_g^{(2)}(\mathrm{ReLU}(W_g^{(1)}[c\|h] + b_g^{(1)})) + b_g^{(2)} \in \mathbb{R}^E

gi=exp(ϕ(c,h)i/τ)j=1Eexp(ϕ(c,h)j/τ)g_i = \frac{\exp\big(\phi(c,h)_i / \tau\big)}{\sum_{j=1}^E\exp\big(\phi(c,h)_j / \tau\big)}

for i=1,,Ei=1,\dots,E, with τ(0,1]\tau \in (0,1].

  • Top-kk Expert Selection and Sparse Fusion:

S(h,c)=TopK({gi},k){1,,E}\mathcal{S}(h, c) = \mathrm{TopK}(\{g_i\}, k) \subset \{1, \dots, E\}

y=iS(h,c)giEi(h)R768y = \sum_{i \in \mathcal{S}(h,c)} g_i E_i(h) \in \mathbb{R}^{768}

3. Expert Architecture and Specialization

Each expert EiE_i is a two-layer MLP:

Ei(h)=Wi(2)GELU(Wi(1)h+bi(1))+bi(2),E_i(h) = W_i^{(2)}\,\mathrm{GELU}\big(W_i^{(1)}\,h + b_i^{(1)}\big) + b_i^{(2)},

with Wi(1)R1024×2304W_i^{(1)}\in \mathbb{R}^{1024 \times 2304} and Wi(2)R768×1024W_i^{(2)}\in \mathbb{R}^{768 \times 1024}. Dropout (0.1) is applied between layers. Experts are structurally identical but become functionally specialized via training—e.g., some may focus on molecular–disease interactions, others on protocol–disease relationships.

Expert specialization is driven by the conditioning network such that, for instance, different expert subsets may dominate for oncology versus metabolic disease, tailoring representation and prediction to clinical subdomains.

4. Integration with Modality Encoders and Fusion

Upstream, each modality passes through a dedicated Transformer and projection:

  • ChemBERTa (drug): Final [CLS] embedding emole_{\mathrm{mol}} is projected as zmol=Pmolemolz_{\mathrm{mol}} = P_{\mathrm{mol}} e_{\mathrm{mol}}.
  • ClinicalBERT (protocol, eligibility, ontology): Similar projection to zmz_m for metadata and zontoz_{\mathrm{onto}} for the disease ontology label.
  • All are aligned to 768-dimensional vectors and concatenated to hR2304h \in \mathbb{R}^{2304}.
  • dd and ss (drug and disease embeddings) are the corresponding projections.

The SMoE module then conditions its routing and expert selection on the clinically germane dd and ss, enabling context-aware fusion on top of standard modality encoding.

5. Training Objectives and Regularization

The primary loss is binary cross-entropy (BCE) for trial success prediction:

Lbce=[ytruelogp^+(1ytrue)log(1p^)].L_{\mathrm{bce}} = -[y_{\mathrm{true}}\log \hat{p} + (1-y_{\mathrm{true}})\log(1-\hat{p})].

To avoid expert collapse, a load-balancing penalty (following -Shazeer et al., 2017) is applied:

Limp=Ni=1NfiPiL_{\mathrm{imp}} = N \sum_{i=1}^N f_i P_i

with fif_i the fraction of samples routed to expert ii, and PiP_i the mean gating weight.

The total loss is:

L=Lbce+λimpLimpL = L_{\mathrm{bce}} + \lambda_{\mathrm{imp}} L_{\mathrm{imp}}

with λimp0.01\lambda_{\mathrm{imp}} \approx 0.01.

After training, scalar temperature calibration for post-hoc probability adjustment is performed to optimize the reliability of p^\hat{p} for downstream risk estimation.

6. Empirical Performance and Ablation Studies

Empirically, the introduction of drug–disease–conditioned SMoE into the fusion backbone yields marked improvements:

Dataset Baseline (AUC) SMoE (AUC) ΔAUC (pp)
TOP (Phase I) 54.7% 69.1% +14.4
TOP (Phase II) 52.7% 58.7% +6.0
TOP (Phase III) 56.1% 56.3% +0.2
CTOD (Phase I) 66.4% 77.5% +11.1
CTOD (Phase II) 58.4% 63.3% +4.9
CTOD (Phase III) 67.5% 72.9% +5.4

Ablation studies indicate that removal of drug–disease subspace conditioning from gating degrades AUC/PR by 2–5 percentage points, and substituting the SMoE with a single shared MLP reduces early-phase AUC by up to 10 points. Gating over all modalities increases compute without consistent metric improvements.

7. Computational Cost and Scalability

Sparse top-kk routing fundamentally reduces computational requirements:

  • Instead of EE expert MLPs (E=16E=16 in MMCTOP) evaluated per instance, only k=2k=2 are executed for a given input (12.5% of dense MoE-compute).
  • Effective FLOPs are decreased 6–8× in the MoE layer, and GPU memory usage is comparably reduced.
  • Inference latency is approximately 5 ms per example (on NVIDIA RTX 4080), compared to ~40 ms if all experts were evaluated.

This cost structure supports both scalability for large-scale clinical trial risk scoring and practical deployment in resource-constrained environments.


The drug–disease–conditioned SMoE in MMCTOP thus delivers context-driven expert routing, specialization, and superior empirical performance with minimal overhead, explicitly leveraging the two most clinically salient dimensions (drug and disease) for dynamic multimodal fusion (Aparício et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Drug-Disease-Conditioned Sparse Mixture-of-Experts (SMoE).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube