Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Traceback Learning for Medical Report Generation (2401.13267v3)

Published 24 Jan 2024 in cs.CV

Abstract: Automated medical report generation has the potential to significantly reduce the workload associated with the time-consuming process of medical reporting. Recent generative representation learning methods have shown promise in integrating vision and language modalities for medical report generation. However, when trained end-to-end and applied directly to medical image-to-text generation, they face two significant challenges: i) difficulty in accurately capturing subtle yet crucial pathological details, and ii) reliance on both visual and textual inputs during inference, leading to performance degradation in zero-shot inference when only images are available. To address these challenges, this study proposes a novel multi-modal dynamic traceback learning framework (DTrace). Specifically, we introduce a traceback mechanism to supervise the semantic validity of generated content and a dynamic learning strategy to adapt to various proportions of image and text input, enabling text generation without strong reliance on the input from both modalities during inference. The learning of cross-modal knowledge is enhanced by supervising the model to recover masked semantic information from a complementary counterpart. Extensive experiments conducted on two benchmark datasets, IU-Xray and MIMIC-CXR, demonstrate that the proposed DTrace framework outperforms state-of-the-art methods for medical report generation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Natural language generation in health care. Journal of the American Medical Informatics Association, 4(6):473–482, 1997.
  2. Generating radiology reports via memory-driven transformer. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, November 2020.
  3. Cross-modal memory networks for radiology report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 5904–5914, Online, August 2021. Association for Computational Linguistics.
  4. Preparing a collection of radiology examinations for distribution and retrieval. Journal of the American Medical Informatics Association : JAMIA, 23, 07 2015.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Multimodal masked autoencoders learn transferable representations. arXiv preprint arXiv:2205.14204, 2022.
  8. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  9. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 590–597, 2019.
  10. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6:317, 12 2019.
  11. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, page 228–231, USA, 2007. Association for Computational Linguistics.
  12. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics.
  13. Competence-based multimodal curriculum learning for medical report generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3001–3012, Online, August 2021. Association for Computational Linguistics.
  14. Cross-modal self-supervised vision language pre-training with multiple objectives for medical visual question answering. 2023.
  15. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  16. Multimodal foundation models are better simulators of the human brain. arXiv preprint arXiv:2208.08263, 2022.
  17. A survey on deep learning and explainability for automatic report generation from medical images. ACM Computing Surveys (CSUR), 54(10s):1–40, 2022.
  18. Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE Journal of Biomedical and Health Informatics, 26(12):6070–6080, 2022.
  19. Bleu: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, page 311–318, USA, 2002. Association for Computational Linguistics.
  20. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  21. From show to tell: A survey on image captioning. arXiv preprint arXiv:2107.06912, 2021.
  22. Hind Taud and JF Mas. Multilayer perceptron (mlp). Geomatic approaches for modeling land change scenarios, pages 451–455, 2018.
  23. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  24. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015.
  25. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  26. Cross-modal prototype driven network for radiology report generation. In European Conference on Computer Vision, pages 563–579. Springer, 2022.
  27. Automated radiographic report generation purely on transformer: A multicriteria supervised approach. IEEE Transactions on Medical Imaging, 41(10):2803–2813, 2022.
  28. Delving into masked autoencoders for multi-label thorax disease classification. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3588–3600, 2023.
  29. Aligntransformer: Hierarchical alignment of visual regions and disease tags for medical report generation. In Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, September 27–October 1, 2021, Proceedings, Part III 24, pages 72–82. Springer, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Shuchang Ye (8 papers)
  2. Mingyuan Meng (26 papers)
  3. Mingjian Li (5 papers)
  4. Dagan Feng (37 papers)
  5. Jinman Kim (72 papers)
  6. Usman Naseem (64 papers)

Summary

We haven't generated a summary for this paper yet.