Progressive trajectory matching for medical dataset distillation (2403.13469v1)
Abstract: It is essential but challenging to share medical image datasets due to privacy issues, which prohibit building foundation models and knowledge transfer. In this paper, we propose a novel dataset distillation method to condense the original medical image datasets into a synthetic one that preserves useful information for building an analysis model without accessing the original datasets. Existing methods tackle only natural images by randomly matching parts of the training trajectories of the model parameters trained by the whole real datasets. However, through extensive experiments on medical image datasets, the training process is extremely unstable and achieves inferior distillation results. To solve these barriers, we propose to design a novel progressive trajectory matching strategy to improve the training stability for medical image dataset distillation. Additionally, it is observed that improved stability prevents the synthetic dataset diversity and final performance improvements. Therefore, we propose a dynamic overlap mitigation module that improves the synthetic dataset diversity by dynamically eliminating the overlap across different images and retraining parts of the synthetic images for better convergence. Finally, we propose a new medical image dataset distillation benchmark of various modalities and configurations to promote fair evaluations. It is validated that our proposed method achieves 8.33% improvement over previous state-of-the-art methods on average, and 11.7% improvement when ipc=2 (i.e., image per class is 2). Codes and benchmarks will be released.
- Dataset of breast ultrasound images. Data in brief, 28:104863, 2020.
- The history began from alexnet: A comprehensive survey on deep learning approaches. arXiv preprint arXiv:1803.01164, 2018.
- The liver tumor segmentation benchmark (lits). Medical Image Analysis, 84:102680, 2023.
- Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4750–4759, 2022.
- Dc-bench: Dataset condensation benchmark. Advances in Neural Information Processing Systems, 35:810–822, 2022.
- Minimizing the accumulated trajectory error to improve dataset distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3749–3758, 2023.
- Sequential subset matching for dataset distillation. arXiv preprint arXiv:2311.01570, 2023.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
- A survey on dataset distillation: Approaches, applications and future directions. arXiv preprint arXiv:2305.01975, 2023.
- A kernel two-sample test. The Journal of Machine Learning Research, 13(1):723–773, 2012.
- Towards lossless dataset distillation via difficulty-aligned trajectory matching. arXiv preprint arXiv:2310.05773, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Predicting survival from colorectal cancer histology slides using deep learning: A retrospective multicenter study. PLoS medicine, 16(1):e1002730, 2019.
- Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172(5):1122–1131, 2018.
- Yann LeCun et al. Lenet-5, convolutional neural networks. URL: http://yann. lecun. com/exdb/lenet, 20(5):14, 2015.
- A comprehensive survey to dataset distillation. arXiv preprint arXiv:2301.05603, 2023.
- Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
- Soft-label anonymous gastric x-ray image distillation. In 2020 IEEE International Conference on Image Processing (ICIP), pages 305–309. IEEE, 2020.
- Compressed gastric image generation based on soft-label dataset distillation for medical data sharing. Computer Methods and Programs in Biomedicine, 227:107189, 2022.
- Dataset distillation for medical dataset sharing. arXiv preprint arXiv:2209.14603, 2022.
- Dataset distillation via factorization. Advances in Neural Information Processing Systems, 35:1100–1113, 2022.
- A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022.
- Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images. Computers in biology and medicine, 132:104319, 2021.
- Data distillation: A survey. arXiv preprint arXiv:2301.04272, 2023.
- Convnets match vision transformers at scale. arXiv preprint arXiv:2310.16764, 2023.
- The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data, 5(1):1–9, 2018.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008.
- Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
- Cafe: Learning to condense dataset by aligning features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12196–12205, 2022.
- Medmnist v2-a large-scale lightweight benchmark for 2d and 3d biomedical image classification. Scientific Data, 10(1):41, 2023.
- Dataset distillation: A comprehensive review. arXiv preprint arXiv:2301.07014, 2023.
- Bo Zhao and Hakan Bilen. Dataset condensation with differentiable siamese augmentation. In International Conference on Machine Learning, pages 12674–12685. PMLR, 2021.
- Bo Zhao and Hakan Bilen. Dataset condensation with distribution matching. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 6514–6523, 2023.
- Dataset condensation with gradient matching. arXiv preprint arXiv:2006.05929, 2020.