Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: With the increasing integration of Multimodal Large Language Models (MLLMs)
into the medical field, comprehensive evaluation of their performance in
various medical domains becomes critical. However, existing benchmarks
primarily assess general medical tasks, inadequately capturing performance in
nuanced areas like the spine, which relies heavily on visual input. To address
this, we introduce SpineBench, a comprehensive Visual Question Answering (VQA)
benchmark designed for fine-grained analysis and evaluation of MLLMs in the
spinal domain. SpineBench comprises 64,878 QA pairs from 40,263 spine images,
covering 11 spinal diseases through two critical clinical tasks: spinal disease
diagnosis and spinal lesion localization, both in multiple-choice format.
SpineBench is built by integrating and standardizing image-label pairs from
open-source spinal disease datasets, and samples challenging hard negative
options for each VQA pair based on visual similarity (similar but not the same
disease), simulating real-world challenging scenarios. We evaluate 12 leading
MLLMs on SpineBench. The results reveal that these models exhibit poor
performance in spinal tasks, highlighting limitations of current MLLM in the
spine domain and guiding future improvements in spinal medicine applications.
SpineBench is publicly available at
https://zhangchenghanyu.github.io/SpineBench.github.io/.
Key Contributions
This paper introduces SpineBench, a comprehensive Visual Question Answering (VQA) benchmark specifically designed for evaluating Multimodal Large Language Models (MLLMs) in the domain of spinal pathology. It comprises a large dataset of QA pairs derived from spine images, covering 11 diseases and two critical clinical tasks, enabling fine-grained assessment of MLLM performance.
Business Value
Developing robust evaluation benchmarks for medical MLLMs is crucial for ensuring their safety and efficacy in clinical settings, accelerating adoption and improving patient care.