Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Radiology, radiation oncology, and medical physics require decision-making
that integrates medical images, textual reports, and quantitative data under
high-stakes conditions. With the introduction of GPT-5, it is critical to
assess whether recent advances in large multimodal models translate into
measurable gains in these safety-critical domains. We present a targeted
zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano)
against GPT-4o across three representative tasks. We present a targeted
zero-shot evaluation of GPT-5 and its smaller variants (GPT-5-mini, GPT-5-nano)
against GPT-4o across three representative tasks: (1) VQA-RAD, a benchmark for
visual question answering in radiology; (2) SLAKE, a semantically annotated,
multilingual VQA dataset testing cross-modal grounding; and (3) a curated
Medical Physics Board Examination-style dataset of 150 multiple-choice
questions spanning treatment planning, dosimetry, imaging, and quality
assurance. Across all datasets, GPT-5 achieved the highest accuracy, with
substantial gains over GPT-4o up to +20.00% in challenging anatomical regions
such as the chest-mediastinal, +13.60% in lung-focused questions, and +11.44%
in brain-tissue interpretation. On the board-style physics questions, GPT-5
attained 90.7% accuracy (136/150), exceeding the estimated human passing
threshold, while GPT-4o trailed at 78.0%. These results demonstrate that GPT-5
delivers consistent and often pronounced performance improvements over GPT-4o
in both image-grounded reasoning and domain-specific numerical problem-solving,
highlighting its potential to augment expert workflows in medical imaging and
therapeutic physics.
Key Contributions
This paper presents a targeted zero-shot evaluation of GPT-5 and its variants against GPT-4o across three key medical reasoning tasks in radiology and radiation oncology. It benchmarks their performance on VQA-RAD, SLAKE, and a medical physics exam dataset, assessing the translation of LLM advances to safety-critical medical domains.
Business Value
Provides crucial insights into the capabilities and limitations of cutting-edge LLMs for medical applications, guiding the development and responsible deployment of AI tools in radiology and oncology.