Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation (2404.00998v1)

Published 1 Apr 2024 in cs.CL and cs.AI

Abstract: Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using LLMs to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Radpeer peer review: relevance, use, concerns, challenges, and direction forward. Journal of the American College of Radiology, 11(9):899–904, 2014.
  2. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  3. From sparse to dense: Gpt-4 summarization with chain of density prompting. arXiv preprint arXiv:2309.04269, 2023.
  4. Artificial intelligence solutions for analysis of x-ray images. Canadian Association of Radiologists Journal, 72(1):60–72, 2021. PMID: 32757950.
  5. R. Alvarado. Should we replace radiologists with deep learning? pigeons, error and trust in medical ai. Bioethics, 36(2):121–133, 2022.
  6. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  7. S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
  8. Radiology-aware model-based evaluation metric for report generation. arXiv preprint arXiv:2311.16764, 2023.
  9. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
  10. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  11. Openmedlm: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. In AAAI 2024 Spring Symposium on Clinical Foundation Models, 2024.
  12. Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
  13. A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis. PloS one, 14(9):e0221339, 2019.
  14. Large language model meets graph neural network in knowledge distillation. arXiv preprint arXiv:2402.05894, 2024.
  15. Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668, 2023.
  16. Radgraph: Extracting clinical entities and relations from radiology reports. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  18. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  19. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
  20. Radgraph2: Modeling disease progression in radiology reports via hierarchical information extraction. In Machine Learning for Healthcare Conference, pages 381–402. PMLR, 2023.
  21. C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
  22. G-eval: Nlg evaluation using gpt-4 with better human alignment. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  24. Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1500–1519, 2020.
  25. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  26. P. Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588, 2023.
  27. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  28. Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023.
  29. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  30. Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9), 2023.
  31. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  32. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
  33. Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zilong Wang (99 papers)
  2. Xufang Luo (25 papers)
  3. Xinyang Jiang (40 papers)
  4. Dongsheng Li (240 papers)
  5. Lili Qiu (50 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.