Multimodal Doctor-in-the-Loop: A Clinically-Guided Explainable Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer (2505.01390v1)

Published 2 May 2025 in cs.CV, cs.AI, and cs.LG

Abstract: This study proposes a novel approach combining Multimodal Deep Learning with intrinsic eXplainable Artificial Intelligence techniques to predict pathological response in non-small cell lung cancer patients undergoing neoadjuvant therapy. Due to the limitations of existing radiomics and unimodal deep learning approaches, we introduce an intermediate fusion strategy that integrates imaging and clinical data, enabling efficient interaction between data modalities. The proposed Multimodal Doctor-in-the-Loop method further enhances clinical relevance by embedding clinicians' domain knowledge directly into the training process, guiding the model's focus gradually from broader lung regions to specific lesions. Results demonstrate improved predictive accuracy and explainability, providing insights into optimal data integration strategies for clinical applications.

Summary

The paper demonstrates that integrating CT images and clinical data using an intermediate fusion strategy improves prediction of pathological response in NSCLC.
It employs a gradual learning process with expert-provided XAI loss to refine model focus on critical lesion regions.
Experimental results from 5-fold cross-validation show superior accuracy, AUC, and MCC compared to unimodal and alternative fusion approaches.

This paper, "Multimodal Doctor-in-the-Loop: A Clinically-Guided Explainable Framework for Predicting Pathological Response in Non-Small Cell Lung Cancer" (2505.01390), introduces a novel approach to predict pathological response (pR) in NSCLC patients undergoing neoadjuvant therapy (NAT) by integrating multimodal data (CT images and clinical features) using deep learning and explainable AI (XAI). The core idea is to develop a robust, explainable, and clinically relevant predictive model for a task where traditional unimodal or radiomics-based methods have limitations, especially on small medical datasets.

The proposed framework, called Multimodal Doctor-in-the-Loop, extends a unimodal Doctor-in-the-Loop approach by incorporating an intermediate fusion strategy for imaging and clinical data. The unimodal CT model is trained using a Gradual Learning (GL) process. This process starts with training on the global lung region (bounding box) using only a classification loss. As training progresses, it introduces expert-provided segmentation masks (lung, then lesion) and an additional XAI loss alongside the classification loss. The XAI loss encourages the model's attention (measured via Grad-CAM heatmaps) to align with the expert-defined regions, progressively guiding the model's focus from broader areas to the specific lesion. A separate unimodal clinical model is trained on structured clinical data using a standard classification loss.

The key technical contribution is the multimodal integration. After independently training the unimodal CT (following the GL+XAI process) and clinical models, the framework fuses their modality-specific feature representations at an intermediate point in the network architecture. This intermediate fusion is implemented by concatenating the feature vectors from the imaging and clinical paths and feeding them into an MLP module. The combined multimodal model is then trained end-to-end using a composite loss function that includes the classification loss and the XAI loss (applied to the imaging pathway's heatmaps), maintaining explainability throughout the multimodal training.

The dataset used for evaluation is an in-house collection of 100 NSCLC patients who received NAT. It includes pre-NAT CT scans with expert segmentations of lung and lesion regions, and comprehensive clinical features covering patient characteristics, tumor information, biopsy details, molecular biomarkers, and treatment details.

Implementation details for the models include:

Imaging Model: A 3D DenseNet169 is used as the backbone for processing CT scans.
Clinical Model: A Multilayer Perceptron (MLP) is used for processing clinical features.
Intermediate Fusion Module: Another MLP concatenates features from the unimodal models.
Training: Experiments are conducted using 5-fold stratified cross-validation. The Adam optimizer is used with an initial learning rate of 0.001 and weight decay. A warm-up period is applied, followed by training for up to 300 epochs with early stopping based on validation loss.
Loss Functions:
- Classification Loss ( $\mathscr{L}_{cls}$ ): Cross-entropy loss.
- XAI Loss ( $\mathscr{L}_{xai}$ ): Mean Squared Error (MSE) between Grad-CAM heatmaps and expert masks. The composite loss is $\mathscr{L} = \mathscr{L}_{cls} + \lambda \mathscr{L}_{xai}$ , where $\lambda=1$ was empirically chosen. Grad-CAM is computed on the first convolutional layer to focus on low-level features comparable to segmentation masks.

Pre-processing steps are crucial for handling multimodal data. CT images are resampled, windowed, normalized, and cropped to a uniform size. Data augmentation (spatial shifts, flips) is applied to enhance robustness. Clinical data undergo one-hot encoding for categorical variables, ordinal encoding for ranked features, and z-score normalization for numerical variables.

The experimental results demonstrate the practical value of the proposed framework. The Multimodal Doctor-in-the-Loop with intermediate fusion achieved the best performance across Accuracy, AUC, and MCC metrics compared to unimodal models and multimodal models using early or late fusion strategies. This highlights the benefit of integrating both data modalities and the effectiveness of intermediate fusion in capturing complementary information and interactions between them. Ablation studies further support the importance of the full Doctor-in-the-Loop methodology (combining GL and XAI guidance) over approaches relying solely on XAI guidance or just segmented inputs.

From an explainability perspective, the framework provides both imaging and clinical insights. Grad-CAM heatmaps generated by the multimodal model show a more refined and precise focus on the lesion area compared to the unimodal CT model. SHAP values applied to the clinical features reveal the relative importance of variables like the number of induction chemotherapy cycles, diagnosis type, treatment interruption days, and radiotherapy technique in influencing predictions, offering actionable insights for clinicians.

The paper notes limitations related to the small size of the single-center dataset and suggests future work on larger, multicentric validation, incorporating longitudinal data, and extending the approach to other clinical tasks or cancer types. The source code is available at \url{https://github.com/cosbidev/Doctor-in-the-Loop}.