Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 183 tok/s Pro
2000 character limit reached

Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight (2508.21777v1)

Published 29 Aug 2025 in cs.CV, cs.AI, and cs.LG

Abstract: Introduction: LLMs (LLM) have shown great potential in clinical decision support. GPT-5 is a novel LLM system that has been specifically marketed towards oncology use. Methods: Performance was assessed using two complementary benchmarks: (i) the ACR Radiation Oncology In-Training Examination (TXIT, 2021), comprising 300 multiple-choice items, and (ii) a curated set of 60 authentic radiation oncologic vignettes representing diverse disease sites and treatment indications. For the vignette evaluation, GPT-5 was instructed to generate concise therapeutic plans. Four board-certified radiation oncologists rated correctness, comprehensiveness, and hallucinations. Inter-rater reliability was quantified using Fleiss' \k{appa}. Results: On the TXIT benchmark, GPT-5 achieved a mean accuracy of 92.8%, outperforming GPT-4 (78.8%) and GPT-3.5 (62.1%). Domain-specific gains were most pronounced in Dose and Diagnosis. In the vignette evaluation, GPT-5's treatment recommendations were rated highly for correctness (mean 3.24/4, 95% CI: 3.11-3.38) and comprehensiveness (3.59/4, 95% CI: 3.49-3.69). Hallucinations were rare with no case reaching majority consensus for their presence. Inter-rater agreement was low (Fleiss' \k{appa} 0.083 for correctness), reflecting inherent variability in clinical judgment. Errors clustered in complex scenarios requiring precise trial knowledge or detailed clinical adaptation. Discussion: GPT-5 clearly outperformed prior model variants on the radiation oncology multiple-choice benchmark. Although GPT-5 exhibited favorable performance in generating real-world radiation oncology treatment recommendations, correctness ratings indicate room for further improvement. While hallucinations were infrequent, the presence of substantive errors underscores that GPT-5-generated recommendations require rigorous expert oversight before clinical implementation.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper demonstrates GPT-5's clinical benchmarking by achieving 92.8% accuracy on the TXIT and 100% accuracy in select treatment planning tasks.
  • It employs domain-specific evaluations and real-world oncologic vignettes to assess improvements in diagnosis, dosage, and treatment recommendation comprehensiveness.
  • Results reveal marked gains over GPT-3.5 and GPT-4, yet persistent challenges such as hallucination necessitate ongoing expert oversight.

Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight

Introduction

The paper titled "Benchmarking GPT-5 in Radiation Oncology: Measurable Gains, but Persistent Need for Expert Oversight" evaluates the efficiency and applicability of the GPT-5 LLM in the field of radiation oncology (2508.21777). GPT-5 is assessed using two benchmarks: the American College of Radiology Radiation Oncology In-Training Examination (TXIT) and a set of real-world oncologic vignettes. The paper's objective is to quantify GPT-5's accuracy while identifying its potential limitations and ensuring data transparency for clinical application.

TXIT Performance Evaluation

This paper observed that GPT-5 achieves a significant increase in performance compared to its predecessors, GPT-3.5 and GPT-4, in the standardized TXIT test by encompassing both multiple-choice and visual interpretation items. The results demonstrated a mean accuracy of 92.8%, outperforming prior models that achieved 62.1% for GPT-3.5 and 78.8% for GPT-4. Figure 1

Figure 1: TXIT accuracy by model. Symbols show mean accuracy and error bars indicate the standard deviation (SD) across five runs for GPT-3.5, GPT-4, and GPT-5.

Domain-Specific Accuracy

In domain-specific evaluations, GPT-5 exhibited superior performance in several critical areas such as dose specification and diagnosis. Significant improvements were noted in categories involving Treatment Planning, Local Control, and Prognosis Assessment, achieving 100% accuracy in multiple areas. Conversely, challenges persist in complexity-demanding topics like Gynecology and fine-grained dosimetry tasks. Figure 2

Figure 2: Domain-wise accuracy across models. Symbols show mean accuracy and error bars indicate the SD across five runs for GPT-3.5, GPT-4, and GPT-5.

Real-World Oncologic Vignette Evaluation

In a set of 60 authentic oncological case vignettes, GPT-5's suggestion capabilities were rigorously evaluated on correctness and comprehensiveness. A positive performance was observed, with a correctness mean of 3.24/4 and comprehensiveness increasing to 3.59/4. The paper highlighted GPT-5's ability to draft complex treatment recommendations, though inter-rater variability indicated the necessity of further clinical judgement integration. Figure 3

Figure 3: Distribution of case-level mean expert ratings for correctness and comprehensiveness across 60 cases. Each box represents the inter-quartile range (IQR) with whiskers indicating outliers, summarizing the distribution of ratings. Case-level mean correctness ranged from 2.25 to 4.00, while case-level mean comprehensiveness ranged from 2.50 to 4.00.

Hallucination and Variability

The phenomenon of hallucination, while scarce, represented a persistent issue. Hallucination rates varied across different tumor sites, primarily emerging in settings where accurate, explicit knowledge of trials was necessary. Figure 4

Figure 4: Hallucination consensus across cases. Bars show the number of cases with 0, 1, 2, 3, or 4 raters flagging hallucination. In this cohort, 36/60 cases had 0 flags and 24/60 had exactly 1 flag.

Conclusion

GPT-5 provides notable improvements over its precursors in the domain of radiation oncology, especially in terms of accuracy and reasoning in structured exams and real-world vignettes. Nonetheless, a key limitation highlighted is the need for expert oversight due to hallucination instances and inter-rater variability. This research underscores GPT-5's role as a model exceptionally suited for augmentative decision-support functionalities in oncological practice rather than autonomous decision-making tools. As the LLM continues to develop, future studies must focus on incorporating real-time guideline updates, dose references, and tighter integration within a multi-modal ecosystem for enhanced clinical applicability.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.