LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation (2404.00998v1)
Abstract: Evaluating generated radiology reports is crucial for the development of radiology AI, but existing metrics fail to reflect the task's clinical requirements. This study proposes a novel evaluation framework using LLMs to compare radiology reports for assessment. We compare the performance of various LLMs and demonstrate that, when using GPT-4, our proposed metric achieves evaluation consistency close to that of radiologists. Furthermore, to reduce costs and improve accessibility, making this method practical, we construct a dataset using LLM evaluation results and perform knowledge distillation to train a smaller model. The distilled model achieves evaluation capabilities comparable to GPT-4. Our framework and distilled model offer an accessible and efficient evaluation method for radiology report generation, facilitating the development of more clinically relevant models. The model will be further open-sourced and accessible.
- Radpeer peer review: relevance, use, concerns, challenges, and direction forward. Journal of the American College of Radiology, 11(9):899–904, 2014.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- From sparse to dense: Gpt-4 summarization with chain of density prompting. arXiv preprint arXiv:2309.04269, 2023.
- Artificial intelligence solutions for analysis of x-ray images. Canadian Association of Radiologists Journal, 72(1):60–72, 2021. PMID: 32757950.
- R. Alvarado. Should we replace radiologists with deep learning? pigeons, error and trust in medical ai. Bioethics, 36(2):121–133, 2022.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72, 2005.
- Radiology-aware model-based evaluation metric for report generation. arXiv preprint arXiv:2311.16764, 2023.
- Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
- Openmedlm: Prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. In AAAI 2024 Spring Symposium on Clinical Foundation Models, 2024.
- Chatgpt outperforms crowd workers for text-annotation tasks. Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023.
- A systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis. PloS one, 14(9):e0221339, 2019.
- Large language model meets graph neural network in knowledge distillation. arXiv preprint arXiv:2402.05894, 2024.
- Maira-1: A specialised large multimodal model for radiology report generation. arXiv preprint arXiv:2311.13668, 2023.
- Radgraph: Extracting clinical entities and relations from radiology reports. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317, 2019.
- Radgraph2: Modeling disease progression in radiology reports via hierarchical information extraction. In Machine Learning for Healthcare Conference, pages 381–402. PMLR, 2023.
- C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- G-eval: Nlg evaluation using gpt-4 with better human alignment. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
- Combining automatic labelers and expert annotations for accurate radiology report labeling using bert. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1500–1519, 2020.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- P. Törnberg. Chatgpt-4 outperforms experts and crowd workers in annotating political twitter messages with zero-shot learning. arXiv preprint arXiv:2304.06588, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Chatcad: Interactive computer-aided diagnosis on medical image using large language models. arXiv preprint arXiv:2302.07257, 2023.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
- Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9), 2023.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279, 2023.
- Zilong Wang (99 papers)
- Xufang Luo (25 papers)
- Xinyang Jiang (40 papers)
- Dongsheng Li (240 papers)
- Lili Qiu (50 papers)