GPT-5 in Radiation Oncology

Updated 2 September 2025

GPT-5 in Radiation Oncology is a sophisticated multimodal AI system that combines language, structured data, and imaging to improve clinical education, planning, and decision support.
It demonstrated state-of-the-art performance by achieving 92.8% accuracy on the TXIT exam and notable gains in dose evaluation and diagnosis over previous models.
The model shows promise in automating treatment planning, tumor segmentation, and patient communication, though challenges in visual reasoning and hallucinations remain.

GPT-5 is a large multimodal LLM designed to perform advanced medical reasoning by integrating natural language, structured data, and imaging modalities. In radiation oncology, its application spans education, clinical decision support, workflow automation, incident analysis, and patient-clinician communication. Recent evaluations benchmark GPT-5 against established LLMs and domain-specialist systems, revealing both substantial gains and persistent challenges for clinical deployment.

1. Accuracy Benchmarks and Domain-Specific Performance

GPT-5 demonstrates marked advances in standardized assessments of radiation oncology knowledge. On the ACR Radiation Oncology In-Training Examination (TXIT), GPT-5 achieved a mean accuracy of 92.8%, significantly exceeding the rates of GPT-4 (78.8%) and GPT-3.5 (62.1%) (Dinc et al., 29 Aug 2025). On a Medical Physics Board Examination-style dataset of 150 questions, GPT-5 scored 90.7%, surpassing the recommended human passing threshold and outperforming GPT-4o by 12.7% (Hu et al., 15 Aug 2025).

Performance improvements are most pronounced in subdomains such as Dose (GPT-5: 87.5%, GPT-4: 59.4%) and Diagnosis (GPT-5: 91.2%, GPT-4: 76.5%) (Dinc et al., 29 Aug 2025). GPT-5 also exhibits high accuracy in Statistics, CNS/Eye, Biology, and Physics questions, with accuracy rates at or above 95%. However, limitations are still evident for image-based multiple-choice questions; in the TXIT evaluation, only 2 of 7 such questions were correctly answered.

In comparative multimodal tasks, such as VQA-RAD and SLAKE (Visual Question Answering in radiology and radiation oncology), GPT-5 attained 74.9% accuracy (VQA-RAD) and 88.6% (SLAKE aggregate), translating to improvements of +5% and +16% over GPT-4o, respectively (Hu et al., 15 Aug 2025). Gains are especially substantial for challenging regions such as chest-mediastinal (+20%), lung (+13.6%), and brain (+11.44%). Across curated clinical vignettes, GPT-5-generated management plans were rated 3.24/4 for correctness and 3.59/4 for comprehensiveness by expert oncologists, yet inter-rater reliability remained low (Fleiss' κ = 0.083) (Dinc et al., 29 Aug 2025).

Benchmark	Metric	GPT-5	GPT-4/4o	Human Threshold
ACR TXIT (2021)	Accuracy (%)	92.8	78.8 (GPT-4)	—
Medical Physics Board (150 Qs)	Accuracy (%)	90.7	78.0 (GPT-4o)	~70–75 (passing)
VQA-RAD	Accuracy (%)	74.9	69.91 (GPT-4o)	—
SLAKE (aggregate)	Accuracy (%)	88.6	72.3 (GPT-4o)	—
Clinical Vignettes (Expert Rated)	Correctness (1–4)	3.24	—	—

These benchmarks establish GPT-5 as state-of-the-art in general knowledge and reasoning tasks relevant to radiation oncology.

2. Multimodal and Workflow Integration

GPT-5's multimodal reasoning pipeline is structured to integrate natural language reports, structured clinical/dosimetric data, and cross-sectional imaging:

Chain-of-thought (CoT) reasoning: Prompts explicitly trigger stepwise diagnostic and therapeutic inference, improving performance across complex, multi-modal cases (Wang et al., 11 Aug 2025).
Algorithmic structure: Inputs include patient narratives (N), structured data (S), and images (I); these are encoded, concatenated, and jointly processed via cross-modal attention:

$Z = f(\mathrm{Concat}(F_N, F_S), F_I)$

The output is a stepwise reasoning chain leading to actionable recommendations.

In decision-support scenarios, GPT-5 processes the following sequential workflow:

Deidentifies and summarizes patient records;
Generates or revises radiotherapy plans aligned with clinical protocols (e.g., dose-volume constraints, prescription targets);
Evaluates image-grounded information (segmentation, morphology, dosimetry);
Provides just-in-time rationales and explanations for its outputs.

Such workflow design aligns with the NCI-recommended curriculum for practitioner training, emphasizing foundational data science, hands-on integration, and ethical, bias-aware model use (Kang et al., 2019).

3. Specialized Applications Across Radiation Oncology

a) Treatment Planning and Quality Assurance

Automated planning systems integrating language vision models (LVMs) demonstrate that GPT-5, as embodied in platforms such as GPT-RadPlan (GPT-4Vision), can iteratively evaluate and refine radiotherapy plans. Using in-context optimization—blending clinical requirements, dose-volume histograms, and text/image prompts—GPT-RadPlan matched or outperformed expert plans in 17 prostate and 13 head and neck cancer cases, improving organ-at-risk sparing by up to 15% (Liu et al., 21 Jun 2024). The optimization formulation is typified by:

$\min_{x} \sum_t w_t ([Kx]_t - d_t)^2 + \sum_s w_s \Theta([Kx]_s - d_s)([Kx]_s - d_s)$

with K as the dose influence matrix and constraints on D95 for targets.

b) Tumor Segmentation and Contouring

Advanced LVMs, including GPT-4V (and extending to GPT-5), reduce false-positive rates in automated tumor contouring. The Oncology Contouring Copilot (OCC) system demonstrates a 35% reduction in false discovery rate and a 72.4% decrease in false positives per scan by merging text and image inputs. Metrics for segmentation, such as F1-score (0.652), show parity or improvement over other LVMs such as ViLT or Claude3 Sonnet (Luo et al., 19 Mar 2025). Prompt engineering techniques—single vision input, sequential guiding questions, vision instructions—are key to controlling hallucination rates and maximizing contour reliability.

c) Clinical Documentation, Triage, and QA

GPT-5's natural language processing capabilities support entity recognition, summary generation, symptom extraction (via prompt refinement in teacher-student architectures), and incident learning system analysis. In incident root cause analysis, GPT-4o achieves cosine similarity scores of 0.831; a plausible projection is that GPT-5 could lower hallucination rates (from a current 29%) and further enhance causal mapping relevant for patient safety audits (Wang et al., 24 Aug 2025).

d) Patient Communication

Closed-domain variants such as RadOnc-GPT (based on GPT-4) demonstrate that LLMs, tightly integrated with EHR via retrieval-augmented generation, can draft patient responses with greater clarity and empathy than clinical teams, while matching on correctness and completeness (Hao et al., 26 Sep 2024). Estimated time savings are 5.2 minutes per message for nurses and 2.4 minutes for clinicians. A plausible implication is that GPT-5, by further tailoring outputs based on clinical context and extending to multimodal data, could amplify these workflow efficiencies.

4. Model Limitations and Persistent Challenges

Despite notable gains, GPT-5 continues to exhibit critical limitations:

Hallucinations: Although rare in expert rating (10% of cases flagged, none with majority consensus) (Dinc et al., 29 Aug 2025), hallucinated or factually inconsistent recommendations still surface, particularly in complex, trial-dependent cases or nuanced clinical scenarios.
Visual Reasoning and Domain Specialization: In image-based MCQ tasks on TXIT, GPT-5 correctly answered only 2 of 7, underscoring limitations in unadapted visual input handling (Dinc et al., 29 Aug 2025). In brain tumor MRI VQA, top-tier GPT-5 models achieved macro-average accuracy of 44.19%, with inter-model differences under 2% and all models well below thresholds considered suitable for direct clinical use (Safari et al., 14 Aug 2025). In mammography VQA, GPT-5 lags behind both expert radiologists (sensitivity gap: –23.4%; specificity gap: –36.6%) and specialized, fine-tuned models (Li et al., 15 Aug 2025).
Reliance on Prompting and Pattern Recognition: Decline in performance on randomization or “None of the above” distractor formats suggests residual dependence on known answer cues, with incomplete robustness to novel or adversarial question formats (Wang et al., 14 Dec 2024).
Inter-Rater Variability and Oversight: Expert ratings of plan outputs and clinical summaries exhibit low inter-rater agreement (Fleiss’ κ 0.083 for correctness), reflecting both inherent clinical heterogeneity and ambiguity in AI-generated rationales. This necessitates ongoing expert review and validation before clinical deployment (Dinc et al., 29 Aug 2025).

5. Comparative Positioning and Domain-Specific Models

Direct comparison between GPT-5 and purpose-built or domain-tuned models (e.g., RadOnc-GPT (Liu et al., 2023), CancerChat (Liu et al., 19 Jan 2024), RO-LMM (Kim et al., 2023)) reveals several trends:

Generalist vs. Specialist: GPT-5 outperforms prior generalist LLMs and base models in most benchmarks, but domain-adapted LLMs and LVMs can surpass GPT-5 on tightly-defined tasks, e.g., breast cancer radiotherapy segmentation or coding (Kim et al., 2023).
Instruction Tuning: CancerChat, a 7B-parameter LLM, achieves nearly parity with ChatGPT in blind preference tests after instruction tuning with radiation oncology–specific data (Liu et al., 19 Jan 2024). This suggests further improvements for GPT-5 can be realized with targeted fine-tuning and supervised calibration against high-quality domain datasets.

Model (Benchmark)	Task/Domain	Performance Metric	Result
GPT-5	TXIT	Overall MCQ accuracy	92.8%
RadOnc-GPT	ICD Coding	ROUGE-1/-2/-L	0.705/0.620/0.703
RO-LMM	Breast RT Segmentation	Dice, IoU, HD-95	Outperforms baselines
CancerChat	ROND Evaluations	Blind chat preference	ChatGPT: 26/50, CancerChat: 24/50

6. Education, Transparency, and Responsible Conduct

GPT-5’s reach in educational settings is substantial: expert-level performance on randomized radiation oncology physics exam questions suggests it is well-suited for trainee self-testing, personalized practice sets, and just-in-time explanation of complex physics (Wang et al., 14 Dec 2024). Prompt engineering techniques—“explain first” and “step-by-step”—substantially boost factual accuracy on challenging items (up to +44% gain for math-based questions) and could be codified in educational tools.

All papers emphasize the need for responsible conduct, including transparent model reporting, rigorous evaluation for bias and fairness, provenance tracking for clinical recommendations, and safeguards against adversarial data use (Kang et al., 2019, Khanmohammadi et al., 2023). Quality assurance frameworks involving routine, case-specific performance monitoring and regulatory compliance (e.g., HIPAA, FDA guidelines) are prerequisites for safe clinical implementation.

7. Future Perspectives and Research Needs

While GPT-5 achieves or exceeds expert-level performance on many benchmarks, further developments are needed to translate these gains into robust, safe, and explainable clinical systems:

Robust Multimodal Adaptation: Ongoing work targets fine-tuning for complex radiomic and imaging domains, with domain-specific optimization of visual encoders and alignment with clinical ground-truth.
Causal Inference: Enhancements in causal reasoning frameworks (including explicit diagramming, chain-of-thought and tree-of-thought prompting) are likely to further improve error detection, incident analysis, and reasoning transparency (Wang et al., 24 Aug 2025).
Clinical Validation: Prospective clinical trials and rigorous outcome monitoring remain the gold standard for assessing AI tool impact on patient care, safety, and operational efficiency.
Expert Oversight: Sustained human-AI collaboration, supported by explicit error reporting and traceable decision chains, is critical. Most consensus remains that GPT-5 and its variants are to be regarded as expert-augmenting, not expert-replacing, tools.

In summary, GPT-5 represents a significant advance in the integration of large multimodal LLMs into radiation oncology. It offers measurable improvements in reasoning, accuracy, and workflow efficiency across a spectrum of clinical and educational tasks, but its outputs—particularly in high-stakes and image-grounded contexts—require expert validation and careful oversight prior to clinical adoption. Continued research will be essential to ensure that the benefits of GPT-5’s heightened capabilities are realized safely and equitably in practice (Dinc et al., 29 Aug 2025).