Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Medical image analysis is essential in modern healthcare. Deep learning has
redirected research focus toward complex medical multimodal tasks, including
report generation and visual question answering. Traditional task-specific
models often fall short in handling these challenges. Large vision-language
models (LVLMs) offer new solutions for solving such tasks. In this study, we
build on the popular LLaVA framework to systematically explore model
architectures and training strategies for both 2D and 3D medical LVLMs. We
present extensive empirical findings and practical guidance. To support
reproducibility and future research, we release a modular codebase, MedM-VL,
and two pre-trained models: MedM-VL-2D for 2D medical image analysis and
MedM-VL-CT-Chest for 3D CT-based applications. The code is available at:
https://github.com/MSIIP/MedM-VL