Brain Hematoma Marker Recognition Using Multitask Learning: SwinTransformer and Swin-Unet (2505.06185v2)

Published 9 May 2025 in cs.LG and cs.CV

Abstract: This paper proposes a method MTL-Swin-Unet which is multi-task learning using transformers for classification and semantic segmentation. For spurious-correlation problems, this method allows us to enhance the image representation with two other image representations: representation obtained by semantic segmentation and representation obtained by image reconstruction. In our experiments, the proposed method outperformed in F-value measure than other classifiers when the test data included slices from the same patient (no covariate shift). Similarly, when the test data did not include slices from the same patient (covariate shift setting), the proposed method outperformed in AUC measure.

Summary

The paper introduces MTL-Swin-Unet, a multitask learning approach using Swin Transformer and Swin-Unet for brain hematoma marker recognition, designed to mitigate spurious correlations.
The method integrates classification, semantic segmentation, and image reconstruction tasks within a unified framework to enhance image feature representation for robust recognition.
Experiments show MTL-Swin-Unet achieves superior accuracy and AUC over baselines, particularly in datasets with covariance shifts, highlighting the benefit of integrating the segmentation task.

Exploration of MTL-Swin-Unet for Brain Hematoma Marker Recognition

The paper presented by Hirata and Okita introduces a method named MTL-Swin-Unet, leveraging multitask learning utilizing transformers for classification and semantic segmentation in the context of brain hematoma marker recognition. The proposed methodology addresses the spurious-correlation phenomenon, a prevalent issue in image recognition models where unintended associations are formed between non-target objects within the image.

Methodology and Architecture

The core innovation lies in enhancing image representation through three distinct processes: semantic segmentation, image reconstruction, and classification. These tasks are integrated within a unified framework using the Swin-Unet architecture. The Swin Transformer encoder serves as the backbone, undergoing gradual downsampling to generate hierarchical representations which are utilized for segmentation and reconstruction tasks, while also providing succinct features for classification via a linear layer.

The paper introduces two methodologies: multi-task learning and joint learning. In the multi-task learning setup, the network is trained concurrently for classification, semantic segmentation, and image reconstruction tasks, sharing parameters through a weighted sum of loss functions. Conversely, joint learning is realized by initially training an encoder for segmentation tasks, followed by freezing its parameters while utilizing the frozen encoder for subsequent classification tasks.

Experimental Validation

Experiments were conducted using CT images from 11 institutions to assess the validity of the proposed model in distinguishing hypodensity markers from other brain lesion types. The findings revealed that MTL-Swin-Unet (cls + seg + rec) achieved notable accuracy and AUC improvements across different datasets, demonstrating its efficacy in handling scenarios both with and without covariance shift.

Results Analysis

MTL-Swin-Unet exhibited superior performance compared to baseline models such as ResNet152, Swin Transformer, and joint-CNN architectures, particularly in settings involving covariance shifts. The segmentation task significantly contributed to refining representations beneficial for the classification task, whereas image reconstruction alone appeared to provide excess or conflicting information.

Moreover, an interesting aspect of the experiments highlighted that while increasing the model size resulted in slight improvements, the default model configurations generally achieved satisfactory outcomes. This suggests that model architecture and task integration hold more substantial influence over performance than merely scaling model size.

Implications and Future Directions

The implications of this research are pivotal for medical image analysis, specifically in improving the accuracy of lesion detection in brain CT scans. The approach exemplifies how multitask learning can minimize spurious correlations by enhancing image features through simultaneous engagement with complementary tasks. The findings pave the way for more robust image recognition models in medical diagnostics.

Future research may explore extending the multitask learning framework to other medical imaging modalities or further optimizing the balance between task weights to enhance model adaptability. Additionally, continued development in AI explainability tools, like Grad-CAM, remains crucial to ensure model transparency in settings where predictive reasoning must align with clinical understanding.

In summary, MTL-Swin-Unet embodies an effective strategy in addressing challenges inherent to medical image classification, with its methodical use of transformer-based multitask learning proving beneficial for complex recognition tasks.