Drug-Disease Conditioned SMoE in MMCTOP

Updated 2 January 2026

The paper presents a novel SMoE that leverages drug and disease embeddings through efficient top-k expert routing to significantly improve clinical trial AUC scores.
It integrates modality-specific embeddings from ChemBERTa and ClinicalBERT to fuse molecular, protocol, and disease information in a unified framework.
Empirical results demonstrate that sparse expert selection reduces computation by 6–8× while achieving substantial performance gains across multiple clinical phases.

A drug–disease–conditioned sparse Mixture-of-Experts (SMoE) is a specialized architecture in multimodal biomedical informatics for tasks such as clinical trial outcome prediction. Within the MMCTOP (MultiModal Clinical-Trial Outcome Prediction) framework, SMoE operates as a “plug-in” module atop multi-encoder fusion backbones to provide expert-driven specialization guided by the specific drug and disease context of each data instance. Distinct from traditional dense mixture-of-experts, SMoE applies top- $k$ routing to ensure computational efficiency, while conditioning the expert selection explicitly on salient molecular and indication features. Empirically, this mechanism yields material improvements in metrics such as AUC, precision, and F1, with only modest increase in computational cost (Aparício et al., 26 Dec 2025).

1. Fusion Backbone and Schematic Overview

The MMCTOP workflow integrates three principal biomedical modalities per trial:

Molecular structure: Encoded via ChemBERTa, outputting a 768-dimensional vector representing the drug in SMILES format.
Protocol and eligibility information: Encoded using ClinicalBERT, again producing 768-dimensional vectors for both long-form and schema-narrativized textual metadata.
Disease ontology labels: Encoded as projected ClinicalBERT embeddings of the indication label.

These modality-specific embeddings $\{z_{\mathrm{mol}}, z_{\mathrm{proto}}, z_{\mathrm{onto}}\}$ are concatenated to form a 2304-dimensional fusion vector $h$ . The molecular and disease projections $d$ and $s$ serve both as fused input and as conditioning signals for the SMoE gating.

A condensed pass through the architecture proceeds as follows:

Each modality is independently encoded and linearly projected.
SMoE computes a context vector $c = f_{\mathrm{cond}}(d, s)$ .
The gating network computes logits over $E$ experts, softmaxed (with temperature scaling) to produce expert assignment scores.
Top- $k$ experts are selected and executed in parallel; their outputs are weighted and summed to yield $y$ .
The output is concatenated with $d$ and $s$ , and passed to a final classification head, outputting the calibrated probability $\hat p$ .

2. Mathematical Formulation

The SMoE block is mathematically defined as:

Conditioning Network:

$c = f_{\mathrm{cond}}(d, s) = \mathrm{ReLU}(W_2\,\mathrm{ReLU}(W_1\,[d\|s] + b_1) + b_2) \in\mathbb{R}^{d_c}$

where $d, s \in \mathbb{R}^{768}$ , and typically $d_c = 512$ .

Gating with Temperature Scaling:

$\phi(c, h) = W_g^{(2)}(\mathrm{ReLU}(W_g^{(1)}[c\|h] + b_g^{(1)})) + b_g^{(2)} \in \mathbb{R}^E$

$g_i = \frac{\exp\big(\phi(c,h)_i / \tau\big)}{\sum_{j=1}^E\exp\big(\phi(c,h)_j / \tau\big)}$

for $i=1,\dots,E$ , with $\tau \in (0,1]$ .

Top- $k$ Expert Selection and Sparse Fusion:

$\mathcal{S}(h, c) = \mathrm{TopK}(\{g_i\}, k) \subset \{1, \dots, E\}$

$y = \sum_{i \in \mathcal{S}(h,c)} g_i E_i(h) \in \mathbb{R}^{768}$

3. Expert Architecture and Specialization

Each expert $E_i$ is a two-layer MLP:

$E_i(h) = W_i^{(2)}\,\mathrm{GELU}\big(W_i^{(1)}\,h + b_i^{(1)}\big) + b_i^{(2)},$

with $W_i^{(1)}\in \mathbb{R}^{1024 \times 2304}$ and $W_i^{(2)}\in \mathbb{R}^{768 \times 1024}$ . Dropout (0.1) is applied between layers. Experts are structurally identical but become functionally specialized via training—e.g., some may focus on molecular–disease interactions, others on protocol–disease relationships.

Expert specialization is driven by the conditioning network such that, for instance, different expert subsets may dominate for oncology versus metabolic disease, tailoring representation and prediction to clinical subdomains.

4. Integration with Modality Encoders and Fusion

Upstream, each modality passes through a dedicated Transformer and projection:

ChemBERTa (drug): Final [CLS] embedding $e_{\mathrm{mol}}$ is projected as $z_{\mathrm{mol}} = P_{\mathrm{mol}} e_{\mathrm{mol}}$ .
ClinicalBERT (protocol, eligibility, ontology): Similar projection to $z_m$ for metadata and $z_{\mathrm{onto}}$ for the disease ontology label.
All are aligned to 768-dimensional vectors and concatenated to $h \in \mathbb{R}^{2304}$ .
$d$ and $s$ (drug and disease embeddings) are the corresponding projections.

The SMoE module then conditions its routing and expert selection on the clinically germane $d$ and $s$ , enabling context-aware fusion on top of standard modality encoding.

5. Training Objectives and Regularization

The primary loss is binary cross-entropy (BCE) for trial success prediction:

$L_{\mathrm{bce}} = -[y_{\mathrm{true}}\log \hat{p} + (1-y_{\mathrm{true}})\log(1-\hat{p})].$

To avoid expert collapse, a load-balancing penalty (following -Shazeer et al., 2017) is applied:

$L_{\mathrm{imp}} = N \sum_{i=1}^N f_i P_i$

with $f_i$ the fraction of samples routed to expert $i$ , and $P_i$ the mean gating weight.

The total loss is:

$L = L_{\mathrm{bce}} + \lambda_{\mathrm{imp}} L_{\mathrm{imp}}$

with $\lambda_{\mathrm{imp}} \approx 0.01$ .

After training, scalar temperature calibration for post-hoc probability adjustment is performed to optimize the reliability of $\hat{p}$ for downstream risk estimation.

6. Empirical Performance and Ablation Studies

Empirically, the introduction of drug–disease–conditioned SMoE into the fusion backbone yields marked improvements:

Dataset	Baseline (AUC)	SMoE (AUC)	ΔAUC (pp)
TOP (Phase I)	54.7%	69.1%	+14.4
TOP (Phase II)	52.7%	58.7%	+6.0
TOP (Phase III)	56.1%	56.3%	+0.2
CTOD (Phase I)	66.4%	77.5%	+11.1
CTOD (Phase II)	58.4%	63.3%	+4.9
CTOD (Phase III)	67.5%	72.9%	+5.4

Ablation studies indicate that removal of drug–disease subspace conditioning from gating degrades AUC/PR by 2–5 percentage points, and substituting the SMoE with a single shared MLP reduces early-phase AUC by up to 10 points. Gating over all modalities increases compute without consistent metric improvements.

7. Computational Cost and Scalability

Sparse top- $k$ routing fundamentally reduces computational requirements:

Instead of $E$ expert MLPs ( $E=16$ in MMCTOP) evaluated per instance, only $k=2$ are executed for a given input (12.5% of dense MoE-compute).
Effective FLOPs are decreased 6–8× in the MoE layer, and GPU memory usage is comparably reduced.
Inference latency is approximately 5 ms per example (on NVIDIA RTX 4080), compared to ~40 ms if all experts were evaluated.

This cost structure supports both scalability for large-scale clinical trial risk scoring and practical deployment in resource-constrained environments.

The drug–disease–conditioned SMoE in MMCTOP thus delivers context-driven expert routing, specialization, and superior empirical performance with minimal overhead, explicitly leveraging the two most clinically salient dimensions (drug and disease) for dynamic multimodal fusion (Aparício et al., 26 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

MMCTOP: A Multimodal Textualization and Mixture-of-Experts Framework for Clinical Trial Outcome Prediction (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Drug-Disease-Conditioned Sparse Mixture-of-Experts (SMoE).

Drug-Disease Conditioned SMoE in MMCTOP

1. Fusion Backbone and Schematic Overview

2. Mathematical Formulation

3. Expert Architecture and Specialization

4. Integration with Modality Encoders and Fusion

5. Training Objectives and Regularization

6. Empirical Performance and Ablation Studies

7. Computational Cost and Scalability

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Drug-Disease Conditioned SMoE in MMCTOP

1. Fusion Backbone and Schematic Overview

2. Mathematical Formulation

3. Expert Architecture and Specialization

4. Integration with Modality Encoders and Fusion

5. Training Objectives and Regularization

6. Empirical Performance and Ablation Studies

7. Computational Cost and Scalability

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research