Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning (2403.12416v3)
Abstract: In the medical multi-modal frameworks, the alignment of cross-modality features presents a significant challenge. However, existing works have learned features that are implicitly aligned from the data, without considering the explicit relationships in the medical context. This data-reliance may lead to low generalization of the learned alignment relationships. In this work, we propose the Eye-gaze Guided Multi-modal Alignment (EGMA) framework to harness eye-gaze data for better alignment of medical visual and textual features. We explore the natural auxiliary role of radiologists' eye-gaze data in aligning medical images and text, and introduce a novel approach by using eye-gaze data, collected synchronously by radiologists during diagnostic evaluations. We conduct downstream tasks of image classification and image-text retrieval on four medical datasets, where EGMA achieved state-of-the-art performance and stronger generalization across different datasets. Additionally, we explore the impact of varying amounts of eye-gaze data on model performance, highlighting the feasibility and utility of integrating this auxiliary data into multi-modal alignment framework.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
- L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang et al., “Grounded language-image pre-training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 965–10 975.
- Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
- L. Yao, R. Huang, L. Hou, G. Lu, M. Niu, H. Xu, X. Liang, Z. Li, X. Jiang, and C. Xu, “Filip: Fine-grained interactive language-image pre-training,” in International Conference on Learning Representations, 2021.
- A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,” arXiv preprint arXiv:1901.07042, 2019.
- J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, 2019, pp. 590–597.
- W. Chen, X. Li, L. Shen, and Y. Yuan, “Fine-grained image-text alignment in medical imaging enables cyclic image-report generation,” arXiv preprint arXiv:2312.08078, 2023.
- K. Zhang, Y. Yang, J. Yu, H. Jiang, J. Fan, Q. Huang, and W. Han, “Multi-task paired masking with alignment modeling for medical vision-language pre-training,” IEEE Transactions on Multimedia, 2023.
- Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” arXiv preprint arXiv:2210.10163, 2022.
- S.-C. Huang, L. Shen, M. P. Lungren, and S. Yeung, “Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 3942–3951.
- F. Wang, Y. Zhou, S. Wang, V. Vardhanabhuti, and L. Yu, “Multi-granularity cross-modal alignment for generalized medical visual representation learning,” arXiv preprint arXiv:2210.06044, 2022.
- R. Geirhos, J.-H. Jacobsen, C. Michaelis, R. S. Zemel, W. Brendel, M. Bethge, and F. A. Wichmann, “Shortcut learning in deep neural networks,” Nature Machine Intelligence, vol. 2, no. 11, pp. 665–673, 2020.
- C. Ma et al., “Eye-gaze-guided vision transformer for rectifying shortcut learning,” IEEE Transactions on Medical Imaging, vol. 42, no. 11, pp. 3384–3394, Nov 2023.
- C. Hsieh, C. Ouyang, J. C. Nascimento, J. Pereira, J. Jorge, and C. Moreira, “Mimic-eye: Integrating mimic datasets with reflacx and eye gaze for multimodal deep learning applications,” 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in Advances in Neural Information Processing Systems, 2019, pp. 13–23.
- H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, 2019.
- Y. Zhang, H. Jiang, Y. Miura, C. D. Manning, and C. P. Langlotz, “Contrastive learning of medical visual representations from paired images and text,” in Machine Learning for Healthcare Conference. PMLR, 2022, pp. 2–25.
- B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision–language processing,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI. Springer, 2022, pp. 1–21.
- C. Wu, X. Zhang, Y. Zhang, Y. Wang, and W. Xie, “Medklip: Medical knowledge enhanced language-image pre-training,” medRxiv, pp. 2023–01, 2023.
- E. A. Krupinski, “Current perspectives in medical image perception,” Attention, Perception, & Psychophysics, vol. 72, no. 5, pp. 1205–1217, Jul 2010.
- H. L. Kundel, C. F. Nodine, E. F. Conant, and S. P. Weinstein, “Holistic component of image perception in mammogram interpretation: Gaze-tracking study,” Radiology, vol. 242, no. 2, pp. 396–402, Feb 2007.
- T. Drew, K. Evans, M. L.-H. Võ, F. L. Jacobson, and J. M. Wolfe, “Informatics in radiology: What can you see in a single glance and how might this guide visual search in medical images?” RadioGraphics, vol. 33, no. 1, pp. 263–274, Jan 2013.
- E. M. Kok and H. Jarodzka, “Before your very eyes: The value and limitations of eye tracking in medical education,” Med. Educ., vol. 51, no. 1, pp. 114–122, Jan 2017.
- N. Khosravan et al., “Gaze2segment: A pilot study for integrating eye-tracking technology into medical image segmentation,” in Medical Computer Vision and Bayesian and Graphical Models for Biomedical Imaging. Cham, Switzerland: Springer, 2016, pp. 94–104.
- S. Mall, E. A. Krupinski, and C. R. Mello-Thoms, “Missed cancer and visual search of mammograms: What feature based machine-learning can tell us that deep-convolution learning cannot,” in Proc. SPIE, Mar 2019, pp. 281–287.
- A. Karargyris et al., “Creation and validation of a chest x-ray dataset with eye-tracking and report dictation for ai development,” Sci. Data, vol. 8, no. 1, pp. 1–18, Mar 2021.
- S. Wang, X. Ouyang, T. Liu, Q. Wang, and D. Shen, “Follow my eye: Using gaze to supervise computer-aided diagnosis,” IEEE Trans. Med. Imag., vol. 41, no. 7, pp. 1688–1698, Jul 2022.
- Q. Men, C. Teng, L. Drukker et al., “Gaze-probe joint guidance with multi-task learning in obstetric ultrasound scanning,” Medical Image Analysis, vol. 90, p. 102981, 2023.
- G. Shih, C. C. Wu, S. S. Halabi, M. D. Kohli, L. M. Prevedello, T. S. Cook, A. Sharma, J. K. Amorosa, V. Arteaga, M. Galperin-Aizenberg et al., “Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia,” Radiology: Artificial Intelligence, vol. 1, no. 1, 2019.
- “SIIM-ACR pneumothorax segmentation,” 2020, [online] Available: https://www.kaggle.com/c/siim-acr-pneumothorax-segmentation.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin Transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 012–10 022.
- E. Alsentzer, J. R. Murphy, W. Boag, W.-H. Weng, D. Jin, T. Naumann, and M. McDermott, “Publicly available clinical bert embeddings,” arXiv preprint arXiv:1904.03323, 2019.
- C. Ma, L. Zhao, Y. Chen, L. Guo, T. Zhang, X. Hu, D. Shen, X. Jiang, and T. Liu, “Rectify vit shortcut learning by visual saliency,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- J. N. Stember, H. Celik, E. Krupinski, P. D. Chang, S. Mutasa, B. J. Wood, A. Lignelli, G. Moonis, L. Schwartz, S. Jambawalikar et al., “Eye tracking for deep learning segmentation using convolutional neural networks,” Journal of digital imaging, vol. 32, no. 4, pp. 597–604, 2019.
- J. N. Stember, H. Celik, D. Gutman, N. Swinburne, R. Young, S. Eskreis-Winkler, A. Holodny, S. Jambawalikar, B. J. Wood, P. D. Chang et al., “Integrating eye tracking and speech recognition accurately annotates mr brain images for deep learning: proof of principle,” Radiology: Artificial Intelligence, vol. 3, no. 1, p. e200047, 2020.
- N. Khosravan, H. Celik, B. Turkbey, E. C. Jones, B. Wood, and U. Bagci, “A collaborative computer aided diagnosis (c-cad) system with eye-tracking, sparse attentional model, and deep learning,” Medical image analysis, vol. 51, pp. 101–115, 2019.
- A. Johnson, L. Bulgarelli, T. Pollard, S. Horng, L. A. Celi, and R. Mark, “Mimic-iv,” PhysioNet. Available online at: https://physionet. org/content/mimiciv/1.0/(accessed August 23, 2021), 2020.
- D. M. Hansell, A. A. Bankier, H. MacMahon, T. C. McLoud, N. L. Muller, and J. Remy, “Fleischner society: glossary of terms for thoracic imaging,” Radiology, vol. 246, no. 3, pp. 697–722, 2008.
- K. Saab, S. M. Hooper, N. S. Sohoni, J. Parmar, B. P. Pogatchnik, S. Wu, J. Dunnmon, H. Zhang, D. L. Rubin, and C. Ré, “Observational supervision for medical image classification using gaze data.” in Medical Image Computing and Computer-Assisted Intervention, 2021.
- J. L. Kröger, O. H.-M. Lutz, and F. Müller, “What does your gaze reveal about you? on the privacy implications of eye tracking,” in IFIP International Summer School on Privacy and Identity Management. Springer, 2020, pp. 226–241.
- C. Katsini, Y. Abdrabou, G. E. Raptis, M. Khamis, and F. Alt, “The role of eye gaze in security and privacy applications: Survey and future hci research directions,” in Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 2020, pp. 1–21.
- Chong Ma (28 papers)
- Hanqi Jiang (27 papers)
- Wenting Chen (26 papers)
- Zihao Wu (100 papers)
- Xiaowei Yu (36 papers)
- Lei Guo (110 papers)
- Dajiang Zhu (68 papers)
- Tuo Zhang (46 papers)
- Dinggang Shen (153 papers)
- Tianming Liu (161 papers)
- Xiang Li (1003 papers)
- Yiwei Li (107 papers)
- Zhengliang Liu (91 papers)