Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Memory-based Cross-modal Semantic Alignment Network for Radiology Report Generation (2404.00588v1)

Published 31 Mar 2024 in cs.CV and cs.AI

Abstract: Generating radiology reports automatically reduces the workload of radiologists and helps the diagnoses of specific diseases. Many existing methods take this task as modality transfer process. However, since the key information related to disease accounts for a small proportion in both image and report, it is hard for the model to learn the latent relation between the radiology image and its report, thus failing to generate fluent and accurate radiology reports. To tackle this problem, we propose a memory-based cross-modal semantic alignment model (MCSAM) following an encoder-decoder paradigm. MCSAM includes a well initialized long-term clinical memory bank to learn disease-related representations as well as prior knowledge for different modalities to retrieve and use the retrieved memory to perform feature consolidation. To ensure the semantic consistency of the retrieved cross modal prior knowledge, a cross-modal semantic alignment module (SAM) is proposed. SAM is also able to generate semantic visual feature embeddings which can be added to the decoder and benefits report generation. More importantly, to memorize the state and additional information while generating reports with the decoder, we use learnable memory tokens which can be seen as prompts. Extensive experiments demonstrate the promising performance of our proposed method which generates state-of-the-art performance on the MIMIC-CXR dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. J. Ramos, T. T. J. P. Kockelkorn, I. Ramos, R. Ramos, J. Grutters, M. A. Viergever, B. van Ginneken, and A. Campilho, “Content-based image retrieval by metric learning from radiology reports: Application to interstitial lung diseases,” IEEE J. Biomed. Health Inform., vol. 20, no. 1, pp. 281–292, 2016.
  2. F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou, “Exploring and distilling posterior and prior knowledge for radiology report generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 13753–13762, 2021.
  3. O. Vinyals, A. Toshev, S. Bengio, and D. Erhan, “Show and tell: A neural image caption generator,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 3156–3164, 2015.
  4. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image caption generation with visual attention,” in Int. Conf. Mach. Learn., pp. 2048–2057, PMLR, 2015.
  5. Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo, “Image captioning with semantic attention,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 4651–4659, 2016.
  6. F. Liu, X. Ren, Y. Liu, H. Wang, and X. Sun, “simnet: Stepwise image-topic merging network for generating detailed and comprehensive image captions,” arXiv:1808.08732, 2018.
  7. Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10971–10980, 2020.
  8. S. Yang, X. Wu, S. Ge, S. K. Zhou, and L. Xiao, “Knowledge matters: Chest radiology report generation with general and specific knowledge,” Med. Image. Anal., vol. 80, p. 102510, 2022.
  9. J. Lovelace and B. Mortazavi, “Learning to generate clinically coherent chest x-ray reports,” in Proc. Findings Assoc. Comput. Linguistics, pp. 1235–1243, 2020.
  10. H. T. Nguyen, D. Nie, T. Badamdorj, Y. Liu, Y. Zhu, J. Truong, and L. Cheng, “Automated generation of accurate and fluent medical x-ray reports,” arXiv:2108.12126, 2021.
  11. X. Wang, Y. Peng, L. Lu, Z. Lu, and R. M. Summers, “Tienet: Text-image embedding network for common thorax disease classification and reporting in chest x-rays,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9049–9058, 2018.
  12. Y. Miura, Y. Zhang, E. B. Tsai, C. P. Langlotz, and D. Jurafsky, “Improving factual completeness and consistency of image-to-text radiology report generation,” arXiv:2010.10042, 2020.
  13. M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 10578–10587, 2020.
  14. Z. Wang, M. Tang, L. Wang, X. Li, and L. Zhou, “A medical semantic-assisted transformer for radiographic report generation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv, pp. 655–664, Springer, 2022.
  15. M. Li, B. Lin, Z. Chen, H. Lin, X. Liang, and X. Chang, “Dynamic graph enhanced contrastive learning for chest x-ray report generation,” arXiv:2303.10323, 2023.
  16. J. Li, S. Li, Y. Hu, and H. Tao, “A self-guided framework for radiology report generation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv, pp. 588–598, Springer, 2022.
  17. Z. Wang, Z. Wu, D. Agarwal, and J. Sun, “Medclip: Contrastive learning from unpaired medical images and text,” arXiv:2210.10163, 2022.
  18. J. H. Moon, H. Lee, W. Shin, Y.-H. Kim, and E. Choi, “Multi-modal understanding and generation for medical images and text via vision-language pre-training,” IEEE J. Biomed. Health Inform., vol. 26, no. 12, pp. 6070–6080, 2022.
  19. I. Najdenkoska, X. Zhen, M. Worring, and L. Shao, “Variational topic inference for chest x-ray report generation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv, pp. 625–635, Springer, 2021.
  20. Z. Wang, L. Zhou, L. Wang, and X. Li, “A self-boosting framework for automated radiographic report generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2433–2442, 2021.
  21. Z. Chen, Y. Song, T.-H. Chang, and X. Wan, “Generating radiology reports via memory-driven transformer,” arXiv:2010.16056, 2020.
  22. Z. Chen, Y. Shen, Y. Song, and X. Wan, “Cross-modal memory networks for radiology report generation,” Proc. 59th Ann. Meeting Assoc. Comput. Linguistics, 11th Int. Joint Conf. Natural Lang. Process, pp. 5904–5914, 2022.
  23. J. Wang, A. Bhalerao, and Y. He, “Cross-modal prototype driven network for radiology report generation,” in Proc. Europ. Conf. Comp. Visi., pp. 563–579, Springer, 2022.
  24. G. Monge, “Mémoire sur la théorie des déblais et des remblais,” Mem. Math. Phys. Acad. Royale Sci., pp. 666–704, 1781.
  25. H. Zhao, D. Phung, V. Huynh, T. Le, and W. Buntine, “Neural topic model via optimal transport,” arXiv:2008.13537, 2020.
  26. S. Lee, D. Lee, S. Jang, and H. Yu, “Toward interpretable semantic textual similarity via optimal transport-based contrastive sentence learning,” arXiv:2202.13196, 2022.
  27. J. Xu, H. Zhou, C. Gan, Z. Zheng, and L. Li, “Vocabulary learning via optimal transport for neural machine translation,” in Proc. 59th Annu. Meeting Assoc. Comput. Linguistics. (C. Zong, F. Xia, W. Li, and R. Navigli, eds.), (Online), pp. 7361–7373, Aug. 2021.
  28. M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,” Adv. Neural Inf. Process. Syst., vol. 26, 2013.
  29. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 770–778, 2016.
  30. Y. Bi, “Dual coding of knowledge in the human brain,” Trends Cogn. Sci., vol. 25, no. 10, pp. 883–895, 2021.
  31. X. Wang, W. Men, J. Gao, A. Caramazza, and Y. Bi, “Two forms of knowledge representations in the human brain,” Neuron, vol. 107, no. 2, pp. 383–393, 2020.
  32. C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. A. Poggio, “Learning with a wasserstein loss,” Adv. Neural Inf. Process. Syst., vol. 28, 2015.
  33. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv:2010.11929, 2020.
  34. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv:1810.04805, 2018.
  35. J. Marino, Y. Yue, and S. Mandt, “Iterative amortized inference,” in Int. Conf. Mach. Learn., pp. 3403–3412, PMLR, 2018.
  36. Y. Li, R. Quan, L. Zhu, and Y. Yang, “Efficient multimodal fusion via interactive prompting,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 2604–2613, 2023.
  37. A. Bulatov, Y. Kuratov, and M. Burtsev, “Recurrent memory transformer,” Adv. Neural Inf. Process. Syst., vol. 35, pp. 11079–11091, 2022.
  38. A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y. Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng, “Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs,” arXiv:1901.07042, 2019.
  39. D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” J. Amer. Med. Inform. Assoc., vol. 23, no. 2, pp. 304–310, 2016.
  40. F. Liu, S. Ge, and X. Wu, “Competence-based multimodal curriculum learning for medical report generation,” arXiv:2206.14579, 2022.
  41. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in Proc. 40th Annu. Meeting Assoc. Comput. Linguistics, pp. 311–318, 2002.
  42. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Proc. Text Summarization Branches Out, pp. 74–81, 2004.
  43. M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems,” in Proc. 6th Workshop stat. Mach. Transl., EMNLP, pp. 85–91, 2011.
  44. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv:1412.6980, 2014.
  45. J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 375–383, 2017.
  46. B. Jing, P. Xie, and E. Xing, “On the automatic generation of medical imaging reports,” in Proc. 56th Annu. Meeting Assoc. Comput. Linguistics (I. Gurevych and Y. Miyao, eds.), (Melbourne, Australia), pp. 2577–2586, July 2018.
  47. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 1179–1195, 2017.
  48. B. Jing, Z. Wang, and E. Xing, “Show, describe and conclude: On exploiting the structure information of chest X-ray reports,” in Proc. 57th Annu. Meeting Assoc. Comput. Linguistics (A. Korhonen, D. Traum, and L. Màrquez, eds.), (Florence, Italy), pp. 6570–6580, July 2019.
  49. J. You, D. Li, M. Okumura, and K. Suzuki, “JPG - jointly learn to align: Automated disease prediction and radiology report generation,” in Proc. Int. Conf. Comput. Linguistics., (Gyeongju, Republic of Korea), pp. 5989–6001, International Committee on Computational Linguistics, Oct. 2022.
  50. B. Yan, M. Pei, M. Zhao, C. Shan, and Z. Tian, “Prior guided transformer for accurate radiology reports generation,” IEEE J. Biomed. Health Inform., vol. 26, no. 11, pp. 5631–5640, 2022.
  51. L. Wang, M. Ning, D. Lu, D. Wei, Y. Zheng, and J. Chen, “An inclusive task-aware framework for radiology report generation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assist. Interv., pp. 568–577, Springer, 2022.
  52. A. Nicolson, J. Dowling, and B. Koopman, “Improving chest x-ray report generation by leveraging warm starting,” Artif. Intell. Med., vol. 144, p. 102633, 2023.
  53. T. Tanida, P. Müller, G. Kaissis, and D. Rueckert, “Interactive and explainable region-guided radiology report generation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 7433–7442, 2023.
  54. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in Int. Conf. Mach. Learn., pp. 1597–1607, PMLR, 2020.
  55. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 9729–9738, 2020.
  56. X. Chen and K. He, “Exploring simple siamese representation learning,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., pp. 15750–15758, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yitian Tao (4 papers)
  2. Liyan Ma (10 papers)
  3. Jing Yu (99 papers)
  4. Han Zhang (338 papers)
Citations (3)