Distributionally Robust Alignment for Medical Federated Vision-Language Pre-training Under Data Heterogeneity (2404.03854v3)
Abstract: Vision-language pre-training (VLP) has emerged as an effective scheme for multimodal representation learning, but its reliance on large-scale multimodal data poses significant challenges for medical applications. Federated learning (FL) offers a promising solution to scale up the dataset for medical VLP while preserving data privacy. However, we observe that client data heterogeneity in real-world scenarios could cause models to learn biased cross-modal alignment during local pre-training. This would limit the transferability of the federally learned representation model on downstream tasks. To address this challenge, we propose Federated Distributionally Robust Alignment (FedDRA), a framework for federated VLP that achieves robust vision-language alignment under heterogeneous conditions. Based on client datasets, we construct a distribution family that encompasses potential test-time domains, and apply a distributionally robust framework to optimize the pre-trained model's performance across this distribution space. This approach bridges the gap between pre-training samples and downstream applications. To avoid over-fitting on client-specific information, we use anchor representation from the global model to guide the local training, and adopt a two-stage approach to first tune deeper layers before updating the entire network. Extensive experiments on real-world datasets demonstrate FedDRA's effectiveness in enhancing medical federated VLP under data heterogeneity. Our method also adapts well to various medical pre-training methods.
- Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images. arXiv preprint arXiv:2310.18652, 2023.
- Learning to exploit temporal structure for biomedical vision-language processing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15016–15027, 2023.
- Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
- Reflacx, a dataset of reports and eye-tracking data for localization of abnormalities in chest x-rays. Scientific data, 9(1):350, 2022.
- Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022, 2003.
- Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision, pp. 1–21. Springer, 2022.
- Clusterfix: A cluster-based debiasing approach without protected-group supervision. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 4870–4879, 2024.
- Federated large language model: A position paper. arXiv preprint arXiv:2307.08925, 2023a.
- Feddat: An approach for foundation model finetuning in multi-modal heterogeneous federated learning. arXiv preprint arXiv:2308.12305, 2023b.
- Graph optimal transport for cross-domain alignment. In International Conference on Machine Learning, pp. 1542–1553. PMLR, 2020a.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020b.
- Exploiting shared representations for personalized federated learning. In International conference on machine learning, pp. 2089–2099. PMLR, 2021.
- Optimization with non-differentiable constraints with applications to fairness, recall, churn, and other goals. Journal of Machine Learning Research, 20(172):1–59, 2019.
- Distributionally robust federated averaging. Advances in neural information processing systems, 33:15111–15122, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Heterofl: Computation and communication efficient federated learning for heterogeneous clients. arXiv preprint arXiv:2010.01264, 2020.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Efficient projections onto the l 1-ball for learning in high dimensions. In Proceedings of the 25th international conference on Machine learning, pp. 272–279, 2008.
- Learning models with uniform performance via distributionally robust optimization. The Annals of Statistics, 49(3):1378–1406, 2021.
- Robust federated learning in a heterogeneous environment. arXiv preprint arXiv:1906.06629, 2019.
- Distributionally robust unsupervised dense retrieval training on web graphs. arXiv preprint arXiv:2310.16605, 2023.
- Fedx: Unsupervised federated learning with cross knowledge distillation. In European Conference on Computer Vision, pp. 691–707. Springer, 2022.
- Healnet–hybrid multi-modal fusion for heterogeneous biomedical data. arXiv preprint arXiv:2311.09115, 2023.
- Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3942–3951, 2021.
- Learn from others and be yourself in heterogeneous federated learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10143–10153, 2022.
- Harmofl: Harmonizing local and global drifts in federated learning on heterogeneous medical images. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 1087–1095, 2022.
- Heterogeneous graph learning for multi-modal medical data analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp. 5141–5150, 2023.
- Self-supervised learning in medicine and healthcare. Nature Biomedical Engineering, 6(12):1346–1352, 2022.
- Integration of artificial intelligence in lung cancer: Rise of the machine. Cell Reports Medicine, 2023.
- Fedmd: Heterogenous federated learning via model distillation. arXiv preprint arXiv:1910.03581, 2019.
- Sequential learning for domain generalization. In European Conference on Computer Vision, pp. 603–619. Springer, 2020a.
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022a.
- Federated learning on non-iid data silos: An experimental study. In 2022 IEEE 38th International Conference on Data Engineering (ICDE), pp. 965–978. IEEE, 2022b.
- Federated learning: Challenges, methods, and future directions. IEEE signal processing magazine, 37(3):50–60, 2020b.
- Exploring plain vision transformer backbones for object detection. In European Conference on Computer Vision, pp. 280–296. Springer, 2022c.
- Distributionally robust learning with stable adversarial training. IEEE Transactions on Knowledge and Data Engineering, 2022.
- Scaling-up medical vision-and-language representation learning with federated learning. Engineering Applications of Artificial Intelligence, 126:107037, 2023a.
- Zoopfl: Exploring black-box foundation models for personalized federated learning. arXiv preprint arXiv:2310.05143, 2023b.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- A comprehensive study of image classification model sensitivity to foregrounds, backgrounds, and visual attributes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 19087–19097, 2022.
- Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. Scientific Data, 9(1):429, 2022.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Multivariate prototype representation for domain-generalized incremental learning. arXiv preprint arXiv:2309.13563, 2023.
- On guiding visual attention with language specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18092–18102, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- The future of digital health with federated learning. NPJ digital medicine, 3(1):119, 2020.
- Reducing reliance on spurious features in medical image classification with spatial specificity. In Machine Learning for Healthcare Conference, pp. 760–784. PMLR, 2022.
- Distributionally robust optimization for deep kernel multiple instance learning. In International Conference on Artificial Intelligence and Statistics, pp. 2188–2196. PMLR, 2021.
- Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artificial Intelligence, 1(1):e180041, 2019.
- On generalizing beyond domains in cross-domain continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9265–9274, 2022.
- Cross-domain federated adaptive prompt tuning for clip. arXiv preprint arXiv:2211.07864, 2022.
- Continual adaptation of visual representations via domain randomization and meta-learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4443–4453, 2021.
- Multi-granularity cross-modal alignment for generalized medical visual representation learning. Advances in Neural Information Processing Systems, 35:33536–33549, 2022.
- Covid-net: A tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images. Scientific reports, 10(1):19549, 2020.
- Chest imagenome dataset (version 1.0. 0). PhysioNet, 5:18, 2021.
- Cross-modal semantic alignment pre-training for vision-and-language navigation. In Proceedings of the 30th ACM International Conference on Multimedia, pp. 4233–4241, 2022.
- Label-efficient self-supervised federated learning for tackling data heterogeneity in medical imaging. IEEE Transactions on Medical Imaging, 2023.
- Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15671–15680, 2022.
- Multimodal federated learning via contrastive representation ensemble. arXiv preprint arXiv:2302.08888, 2023a.
- Federated foundation models: Privacy-preserving and collaborative learning for large models. arXiv preprint arXiv:2305.11414, 2023b.
- Federated unsupervised representation learning. Frontiers of Information Technology & Electronic Engineering, 24(8):1181–1193, 2023a.
- Unified fair federated learning for digital healthcare. Patterns, 2023b.
- Heterogeneous feature fusion and cross-modal alignment for composed image retrieval. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 5353–5362, 2021.
- Robust self-supervised structural graph neural network for social network prediction. In Proceedings of the ACM Web Conference 2022, pp. 1352–1361, 2022a.
- Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference, pp. 2–25. PMLR, 2022b.
- Fedvln: Privacy-preserving federated vision-and-language navigation. In European Conference on Computer Vision, pp. 682–699. Springer, 2022.
- Divergence-aware federated self-supervised learning. arXiv preprint arXiv:2204.04385, 2022.
- When foundation model meets federated learning: Motivations, challenges, and future directions. arXiv preprint arXiv:2306.15546, 2023.