Drug-Disease Conditioned SMoE in MMCTOP
- The paper presents a novel SMoE that leverages drug and disease embeddings through efficient top-k expert routing to significantly improve clinical trial AUC scores.
- It integrates modality-specific embeddings from ChemBERTa and ClinicalBERT to fuse molecular, protocol, and disease information in a unified framework.
- Empirical results demonstrate that sparse expert selection reduces computation by 6–8× while achieving substantial performance gains across multiple clinical phases.
A drug–disease–conditioned sparse Mixture-of-Experts (SMoE) is a specialized architecture in multimodal biomedical informatics for tasks such as clinical trial outcome prediction. Within the MMCTOP (MultiModal Clinical-Trial Outcome Prediction) framework, SMoE operates as a “plug-in” module atop multi-encoder fusion backbones to provide expert-driven specialization guided by the specific drug and disease context of each data instance. Distinct from traditional dense mixture-of-experts, SMoE applies top- routing to ensure computational efficiency, while conditioning the expert selection explicitly on salient molecular and indication features. Empirically, this mechanism yields material improvements in metrics such as AUC, precision, and F1, with only modest increase in computational cost (Aparício et al., 26 Dec 2025).
1. Fusion Backbone and Schematic Overview
The MMCTOP workflow integrates three principal biomedical modalities per trial:
- Molecular structure: Encoded via ChemBERTa, outputting a 768-dimensional vector representing the drug in SMILES format.
- Protocol and eligibility information: Encoded using ClinicalBERT, again producing 768-dimensional vectors for both long-form and schema-narrativized textual metadata.
- Disease ontology labels: Encoded as projected ClinicalBERT embeddings of the indication label.
These modality-specific embeddings are concatenated to form a 2304-dimensional fusion vector . The molecular and disease projections and serve both as fused input and as conditioning signals for the SMoE gating.
A condensed pass through the architecture proceeds as follows:
- Each modality is independently encoded and linearly projected.
- SMoE computes a context vector .
- The gating network computes logits over experts, softmaxed (with temperature scaling) to produce expert assignment scores.
- Top- experts are selected and executed in parallel; their outputs are weighted and summed to yield .
- The output is concatenated with and , and passed to a final classification head, outputting the calibrated probability .
2. Mathematical Formulation
The SMoE block is mathematically defined as:
- Conditioning Network:
where , and typically .
- Gating with Temperature Scaling:
for , with .
- Top- Expert Selection and Sparse Fusion:
3. Expert Architecture and Specialization
Each expert is a two-layer MLP:
with and . Dropout (0.1) is applied between layers. Experts are structurally identical but become functionally specialized via training—e.g., some may focus on molecular–disease interactions, others on protocol–disease relationships.
Expert specialization is driven by the conditioning network such that, for instance, different expert subsets may dominate for oncology versus metabolic disease, tailoring representation and prediction to clinical subdomains.
4. Integration with Modality Encoders and Fusion
Upstream, each modality passes through a dedicated Transformer and projection:
- ChemBERTa (drug): Final [CLS] embedding is projected as .
- ClinicalBERT (protocol, eligibility, ontology): Similar projection to for metadata and for the disease ontology label.
- All are aligned to 768-dimensional vectors and concatenated to .
- and (drug and disease embeddings) are the corresponding projections.
The SMoE module then conditions its routing and expert selection on the clinically germane and , enabling context-aware fusion on top of standard modality encoding.
5. Training Objectives and Regularization
The primary loss is binary cross-entropy (BCE) for trial success prediction:
To avoid expert collapse, a load-balancing penalty (following -Shazeer et al., 2017) is applied:
with the fraction of samples routed to expert , and the mean gating weight.
The total loss is:
with .
After training, scalar temperature calibration for post-hoc probability adjustment is performed to optimize the reliability of for downstream risk estimation.
6. Empirical Performance and Ablation Studies
Empirically, the introduction of drug–disease–conditioned SMoE into the fusion backbone yields marked improvements:
| Dataset | Baseline (AUC) | SMoE (AUC) | ΔAUC (pp) |
|---|---|---|---|
| TOP (Phase I) | 54.7% | 69.1% | +14.4 |
| TOP (Phase II) | 52.7% | 58.7% | +6.0 |
| TOP (Phase III) | 56.1% | 56.3% | +0.2 |
| CTOD (Phase I) | 66.4% | 77.5% | +11.1 |
| CTOD (Phase II) | 58.4% | 63.3% | +4.9 |
| CTOD (Phase III) | 67.5% | 72.9% | +5.4 |
Ablation studies indicate that removal of drug–disease subspace conditioning from gating degrades AUC/PR by 2–5 percentage points, and substituting the SMoE with a single shared MLP reduces early-phase AUC by up to 10 points. Gating over all modalities increases compute without consistent metric improvements.
7. Computational Cost and Scalability
Sparse top- routing fundamentally reduces computational requirements:
- Instead of expert MLPs ( in MMCTOP) evaluated per instance, only are executed for a given input (12.5% of dense MoE-compute).
- Effective FLOPs are decreased 6–8× in the MoE layer, and GPU memory usage is comparably reduced.
- Inference latency is approximately 5 ms per example (on NVIDIA RTX 4080), compared to ~40 ms if all experts were evaluated.
This cost structure supports both scalability for large-scale clinical trial risk scoring and practical deployment in resource-constrained environments.
The drug–disease–conditioned SMoE in MMCTOP thus delivers context-driven expert routing, specialization, and superior empirical performance with minimal overhead, explicitly leveraging the two most clinically salient dimensions (drug and disease) for dynamic multimodal fusion (Aparício et al., 26 Dec 2025).