Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Medical visual question answering (Med-VQA) is a crucial multimodal task in
clinical decision support and telemedicine. Recent self-attention based methods
struggle to effectively handle cross-modal semantic alignments between vision
and language. Moreover, classification-based methods rely on predefined answer
sets. Treating this task as a simple classification problem may make it unable
to adapt to the diversity of free-form answers and overlook the detailed
semantic information of free-form answers. In order to tackle these challenges,
we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL)
framework that learns cross-modal feature representations from images and
texts. CMI-MTL comprises three key modules: fine-grained visual-text feature
alignment (FVTA), cross-modal interleaved feature representation (CIFR), and
free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most
relevant regions in image-text pairs through fine-grained visual-text feature
alignment. CIFR captures cross-modal sequential interactions via cross-modal
interleaved feature representation. FFAE leverages auxiliary knowledge from
open-ended questions through free-form answer-enhanced multi-task learning,
improving the model's capability for open-ended Med-VQA. Experimental results
show that CMI-MTL outperforms the existing state-of-the-art methods on three
Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more
interpretability experiments to prove the effectiveness. The code is publicly
available at https://github.com/BioMedIA-repo/CMI-MTL.
Authors (10)
Qiangguo Jin
Xianyao Zheng
Hui Cui
Changming Sun
Yuqi Fang
Cong Cong
+4 more
Submitted
November 3, 2025
PG2025 Conference Papers, Posters, and Demos, 2025
Key Contributions
Introduces CMI-MTL, a Cross-Mamba Interaction based Multi-Task Learning framework for Med-VQA. It enhances cross-modal alignment using Mamba's capabilities, addresses limitations of classification-based methods by supporting free-form answers, and improves learning of cross-modal feature representations.
Business Value
Enhances AI-driven diagnostic support systems, potentially improving accuracy and efficiency in medical image interpretation and patient consultation.