Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data (2303.14080v3)

Published 24 Mar 2023 in cs.CV

Abstract: Medical datasets and especially biobanks, often contain extensive tabular data with rich clinical information in addition to images. In practice, clinicians typically have less data, both in terms of diversity and scale, but still wish to deploy deep learning solutions. Combined with increasing medical dataset sizes and expensive annotation costs, the necessity for unsupervised methods that can pretrain multimodally and predict unimodally has risen. To address these needs, we propose the first self-supervised contrastive learning framework that takes advantage of images and tabular data to train unimodal encoders. Our solution combines SimCLR and SCARF, two leading contrastive learning strategies, and is simple and effective. In our experiments, we demonstrate the strength of our framework by predicting risks of myocardial infarction and coronary artery disease (CAD) using cardiac MR images and 120 clinical features from 40,000 UK Biobank subjects. Furthermore, we show the generalizability of our approach to natural images using the DVM car advertisement dataset. We take advantage of the high interpretability of tabular data and through attribution and ablation experiments find that morphometric tabular features, describing size and shape, have outsized importance during the contrastive learning process and improve the quality of the learned embeddings. Finally, we introduce a novel form of supervised contrastive learning, label as a feature (LaaF), by appending the ground truth label as a tabular feature during multimodal pretraining, outperforming all supervised contrastive baselines.

PDF Abstract

Multimodal Contrastive Learning with Tabular and Imaging Data

The paper "Best of Both Worlds: Multimodal Contrastive Learning with Tabular and Imaging Data" presents a self-supervised contrastive learning framework designed to integrate both tabular and imaging data, a typically underexplored synergy in the domain of deep learning. The authors illustrate their methodology in the context of medical datasets of biobanks that possess rich tabular data complemented with imaging data. The paper addresses the compelling need for unsupervised methods that exploit the availability of multimodal data for effective unimodal predictions, especially in scenarios where smaller, less diverse datasets are the norm, rather than exceptions.

The cornerstone of the proposed framework is its combination of two established contrastive learning methodologies: SimCLR and SCARF. By leveraging these techniques, the framework presents a pathway to train unimodal encoders in a multimodal pretraining setup. This approach is validated through rigorous experimentation, notably on predicting risks associated with myocardial infarction and coronary artery disease (CAD) from cardiac MRI images and clinical features sourced from the UK Biobank. The scale of this dataset encompasses about 40,000 subjects, underscoring the framework's applicability to large and diverse medical datasets.

One of the significant findings presented in this paper is the emphasis on morphometric tabular features—attributes related to size and shape—which show disproportionate importance in the learning process. This was established using attribution analysis through integrated gradients, coupled with ablation studies. Moreover, the research introduces a novel supervised contrastive learning method termed "Label as a Feature" (LaaF), which enhances multimodal pretraining by appending ground truth labels as a tabular feature. This strategy yielded superior performance compared to existing supervised contrastive baselines.

Results and Implications

The research demonstrates the framework’s robustness across different data regimes, including low-data contexts, highlighting its utility in medical settings where data scarcity is often a challenge. In comparisons with traditional and contemporary contrastive learning approaches like SimCLR, BYOL, and others, the multimodal architecture showcased improved outcomes in predictive tasks, particularly with frozen encoders.

From a broader perspective, this work suggests several important implications for the future of AI in medical informatics:

Enhanced Interpretability: Incorporating tabular data with imaging offers more interpretable models as each data feature directly maps to a semantic medical feature, a vital component in medical AI.
Reduction in Label Dependency: By enabling effective training through unlabeled data, the framework reduces reliance on costly and time-consuming human annotation processes, a common bottleneck in medical imaging applications.
Versatility Across Datasets: Though initially intended for medical datasets, the framework’s successful application to a natural image dataset (DVM car advertisement dataset) suggests its potential adaptability to other domains, where tabular and imaging data coexist, warranting further exploration.
Potential for Further Integration: Given the promising results with two data modalities, an interesting extension could be exploring integrated approaches accommodating text and genetic information, which, alongside tabular and imaging data, offers a rich tapestry for patient history and diagnostics.

Future Considerations

The introduction of LaaF opens avenues for future work around its integration with various contrastive learning paradigms. Additionally, exploring the framework’s effectiveness in regression tasks and segmentation, rather than solely classification tasks, would be a promising direction. The challenge remains how best to leverage multimodal strategies not only to enhance predictive capability but also to ensure fairness and mitigate potential biases, especially in datasets reflecting demographic disparities such as ethnicity and socioeconomic status.

In conclusion, this work contributes effectively to the literature on contrastive learning by advancing the multimodal integration of tabular and imaging data. It sets a precedent for approaches that leverage the strengths of diverse data types to optimize unimodal prediction tasks, with potentially far-reaching implications across AI-driven domains.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Paul Hager (13 papers)
Martin J. Menten (21 papers)
Daniel Rueckert (335 papers)

Citations (33)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - paulhager/MMCL-Tabular-Imaging (114 stars)

Tweets

https://twitter.com/shreydan/status/1841913050519240830

YouTube

Show All Videos