Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
143 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation (2211.01412v2)

Published 2 Nov 2022 in cs.CV

Abstract: Radiology report generation (RRG) has gained increasing research attention because of its huge potential to mitigate medical resource shortages and aid the process of disease decision making by radiologists. Recent advancements in RRG are largely driven by improving a model's capabilities in encoding single-modal feature representations, while few studies explicitly explore the cross-modal alignment between image regions and words. Radiologists typically focus first on abnormal image regions before composing the corresponding text descriptions, thus cross-modal alignment is of great importance to learn a RRG model which is aware of abnormalities in the image. Motivated by this, we propose a Class Activation Map guided Attention Network (CAMANet) which explicitly promotes crossmodal alignment by employing aggregated class activation maps to supervise cross-modal attention learning, and simultaneously enrich the discriminative information. CAMANet contains three complementary modules: a Visual Discriminative Map Generation module to generate the importance/contribution of each visual token; Visual Discriminative Map Assisted Encoder to learn the discriminative representation and enrich the discriminative information; and a Visual Textual Attention Consistency module to ensure the attention consistency between the visual and textual tokens, to achieve the cross-modal alignment. Experimental results demonstrate that CAMANet outperforms previous SOTA methods on two commonly used RRG benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Z. Chen, Y. Song, T.-H. Chang, and X. Wan, “Generating radiology reports via memory-driven transformer,” in Proceedings of the 2020 Conference on EMNLP, 2020, pp. 1439–1449.
  2. Z. Chen, Y. Shen, Y. Song, and X. Wan, “Cross-modal memory networks for radiology report generation,” in ACL (Long), 2021, pp. 5904–5914.
  3. J. Wang, A. Bhalerao, and Y. He, “Cross-modal prototype driven network for radiology report generation,” in Proceedings of the 17th European Conference on Computer Vision, Tel Aviv, Israel, October 23–27, 2022, Part XXXV.   Springer, 2022, pp. 563–579.
  4. G. Liu, T.-M. H. Hsu, M. McDermott, W. Boag, W.-H. Weng, P. Szolovits, and M. Ghassemi, “Clinically accurate chest x-ray report generation,” in Machine Learning for Healthcare Conference.   PMLR, 2019, pp. 249–269.
  5. Y. Zhang, X. Wang, Z. Xu, Q. Yu, A. Yuille, and D. Xu, “When radiology report generation meets knowledge graph,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 12 910–12 917.
  6. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
  7. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077–6086.
  8. L. Guo, J. Liu, X. Zhu, P. Yao, S. Lu, and H. Lu, “Normalized and geometry-aware self-attention network for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 327–10 336.
  9. Y. Pan, T. Yao, Y. Li, and T. Mei, “X-linear attention networks for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 971–10 980.
  10. S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel, “Self-critical sequence training for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 7008–7024.
  11. J. Lu, C. Xiong, D. Parikh, and R. Socher, “Knowing when to look: Adaptive attention via a visual sentinel for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 375–383.
  12. M. Cornia, M. Stefanini, L. Baraldi, and R. Cucchiara, “Meshed-memory transformer for image captioning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 578–10 587.
  13. J. Ji, Y. Luo, X. Sun, F. Chen, G. Luo, Y. Wu, Y. Gao, and R. Ji, “Improving image captioning by leveraging intra-and inter-layer global representation in transformer network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 2, 2021, pp. 1655–1663.
  14. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, 2017.
  15. J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei, “A hierarchical approach for generating descriptive image paragraphs,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 317–325.
  16. L. Melas-Kyriazi, A. M. Rush, and G. Han, “Training for diversity in image paragraph captioning,” in Proceedings of the 2018 Conference on EMNLP, 2018, pp. 757–761.
  17. B. Jing, P. Xie, and E. Xing, “On the automatic generation of medical imaging reports,” in ACL (Long), 2018, pp. 2577–2586.
  18. F. Liu, X. Wu, S. Ge, W. Fan, and Y. Zou, “Exploring and distilling posterior and prior knowledge for radiology report generation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 753–13 762.
  19. Y. Zhou, M. Wang, D. Liu, Z. Hu, and H. Zhang, “More grounded image captioning by distilling image-text matching model,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 4777–4786.
  20. K. Chen, J. Gao, and R. Nevatia, “Knowledge aided consistency for weakly supervised phrase grounding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 4042–4050.
  21. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in Neural Information Processing Systems, vol. 28, 2015.
  22. W. Bae, J. Noh, and G. Kim, “Rethinking class activation mapping for weakly supervised object localization,” in European Conference on Computer Vision.   Springer, 2020, pp. 618–634.
  23. P.-T. Jiang, C.-B. Zhang, Q. Hou, M.-M. Cheng, and Y. Wei, “Layercam: Exploring hierarchical class activation maps for localization,” IEEE Transactions on Image Processing, vol. 30, pp. 5875–5888, 2021.
  24. J. Xie, C. Luo, X. Zhu, Z. Jin, W. Lu, and L. Shen, “Online refinement of low-level feature based activation map for weakly supervised object localization,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 132–141.
  25. K. Sun, H. Shi, Z. Zhang, and Y. Huang, “Ecs-net: Improving weakly supervised semantic segmentation by using connections between class activation maps,” in Proceedings of the IEEE International Conference on Computer Vision, 2021, pp. 7283–7292.
  26. Z. Chen, T. Wang, X. Wu, X.-S. Hua, H. Zhang, and Q. Sun, “Class re-activation maps for weakly-supervised semantic segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 969–978.
  27. L. Ru, Y. Zhan, B. Yu, and B. Du, “Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 846–16 855.
  28. Z. Liu, A. Zhong, Y. Li, L. Yang, C. Ju, Z. Wu, C. Ma, P. Shu, C. Chen, S. Kim et al., “Radiology-gpt: A large language model for radiology,” arXiv preprint arXiv:2306.08666, 2023.
  29. K. Singhal, T. Tu, J. Gottweis, R. Sayres, E. Wulczyn, L. Hou, K. Clark, S. Pfohl, H. Cole-Lewis, D. Neal et al., “Towards expert-level medical question answering with large language models,” arXiv preprint arXiv:2305.09617, 2023.
  30. B. Boecking, N. Usuyama, S. Bannur, D. C. Castro, A. Schwaighofer, S. Hyland, M. Wetscherek, T. Naumann, A. Nori, J. Alvarez-Valle et al., “Making the most of text semantics to improve biomedical vision–language processing,” in European conference on computer vision.   Springer, 2022, pp. 1–21.
  31. J. Wang, L. Zhu, A. Bhalerao, and Y. He, “Can prompt learning benefit radiology report generation?” arXiv preprint arXiv:2308.16269, 2023.
  32. J. H. Moon, H. Lee, W. Shin, Y.-H. Kim, and E. Choi, “Multi-modal understanding and generation for medical images and text via vision-language pre-training,” IEEE Journal of Biomedical and Health Informatics, vol. 26, no. 12, pp. 6070–6080, 2022.
  33. J.-b. Delbrouck, K. Saab, M. Varma, S. Eyuboglu, P. Chambon, J. Dunnmon, J. Zambrano, A. Chaudhari, and C. Langlotz, “Vilmedic: a framework for research at the intersection of vision and language in medical ai,” in Proceedings of the 60th ACL: System Demonstrations, 2022, pp. 23–34.
  34. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 4700–4708.
  35. J. Irvin, P. Rajpurkar, M. Ko, Y. Yu, S. Ciurea-Ilcus, C. Chute, H. Marklund, B. Haghgoo, R. Ball, K. Shpanskaya et al., “Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 590–597.
  36. K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation,” in ACL, 2002, pp. 311–318.
  37. C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in Text Summarization Branches Out, 2004, pp. 74–81.
  38. M. Denkowski and A. Lavie, “Meteor 1.3: Automatic metric for reliable optimization and evaluation of machine translation systems,” in Proceedings of the sixth workshop on Statistical Machine Translation, 2011, pp. 85–91.
  39. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4566–4575.
  40. A. Smit, S. Jain, P. Rajpurkar, A. Pareek, A. Y. Ng, and M. Lungren, “Combining automatic labelers and expert annotations for accurate radiology report labeling using bert,” in Proceedings of the 2020 Conference on EMNLP, 2020, pp. 1500–1519.
  41. Y. Li, X. Liang, Z. Hu, and E. P. Xing, “Hybrid retrieval-generation reinforced agent for medical image report generation,” Advances in Neural Information Processing Systems, vol. 31, 2018.
  42. B. Jing, Z. Wang, and E. Xing, “Show, describe and conclude: On exploiting the structure information of chest x-ray reports,” in ACL, 2019, pp. 6570–6580.
  43. C. Y. Li, X. Liang, Z. Hu, and E. P. Xing, “Knowledge-driven encode, retrieve, paraphrase for medical image report generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6666–6673.
  44. F. Liu, S. Ge, and X. Wu, “Competence-based multimodal curriculum learning for medical report generation,” in ACL (Long).   Association for Computational Linguistics, 2021, pp. 3001–3012.
  45. B. Hou, G. Kaissis, R. M. Summers, and B. Kainz, “Ratchet: Medical transformer for chest x-ray diagnosis and reporting,” in International Conference on Medical Image Computing and Computer-Assisted Intervention.   Springer, 2021, pp. 293–303.
  46. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition.   Ieee, 2009, pp. 248–255.
  47. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations, (Poster), 2015.
  48. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  49. Z.-Y. Dou, Y. Xu, Z. Gan, J. Wang, S. Wang, L. Wang, C. Zhu, P. Zhang, L. Yuan, N. Peng et al., “An empirical study of training end-to-end vision-and-language transformers,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 166–18 176.
  50. Z.-Y. Dou, A. Kamath, Z. Gan, P. Zhang, J. Wang, L. Li, Z. Liu, C. Liu, Y. LeCun, N. Peng et al., “Coarse-to-fine vision-language pre-training with fusion in the backbone,” Advances in neural information processing systems, vol. 35, pp. 32 942–32 956, 2022.
  51. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
Citations (11)

Summary

We haven't generated a summary for this paper yet.