arxiv_cv 95% Match Research Paper Medical AI researchers,Radiologists,Clinicians,Healthcare IT professionals 1 day ago

CMI-MTL: Cross-Mamba interaction based multi-task learning for medical visual question answering

large-language-models › multimodal-llms

📄 Abstract

Abstract: Medical visual question answering (Med-VQA) is a crucial multimodal task in clinical decision support and telemedicine. Recent self-attention based methods struggle to effectively handle cross-modal semantic alignments between vision and language. Moreover, classification-based methods rely on predefined answer sets. Treating this task as a simple classification problem may make it unable to adapt to the diversity of free-form answers and overlook the detailed semantic information of free-form answers. In order to tackle these challenges, we introduce a Cross-Mamba Interaction based Multi-Task Learning (CMI-MTL) framework that learns cross-modal feature representations from images and texts. CMI-MTL comprises three key modules: fine-grained visual-text feature alignment (FVTA), cross-modal interleaved feature representation (CIFR), and free-form answer-enhanced multi-task learning (FFAE). FVTA extracts the most relevant regions in image-text pairs through fine-grained visual-text feature alignment. CIFR captures cross-modal sequential interactions via cross-modal interleaved feature representation. FFAE leverages auxiliary knowledge from open-ended questions through free-form answer-enhanced multi-task learning, improving the model's capability for open-ended Med-VQA. Experimental results show that CMI-MTL outperforms the existing state-of-the-art methods on three Med-VQA datasets: VQA-RAD, SLAKE, and OVQA. Furthermore, we conduct more interpretability experiments to prove the effectiveness. The code is publicly available at https://github.com/BioMedIA-repo/CMI-MTL.

Authors (10)

Qiangguo Jin

Xianyao Zheng

Hui Cui

Changming Sun

Yuqi Fang

Cong Cong

+4 more

Submitted

November 3, 2025

arXiv Category

cs.CV

PG2025 Conference Papers, Posters, and Demos, 2025

arXiv PDF

Key Contributions

Introduces CMI-MTL, a Cross-Mamba Interaction based Multi-Task Learning framework for Med-VQA. It enhances cross-modal alignment using Mamba's capabilities, addresses limitations of classification-based methods by supporting free-form answers, and improves learning of cross-modal feature representations.

Business Value

Enhances AI-driven diagnostic support systems, potentially improving accuracy and efficiency in medical image interpretation and patient consultation.

Paper Metadata

Innovation Type

Framework/Architecture

Deployment Feasibility

Moderate, requires integration into existing medical imaging and AI platforms; validation in clinical settings is crucial.

Limitations Addressed

Struggles of self-attention methods in cross-modal alignment for Med-VQA, inability of classification-based methods to handle free-form answers, overlooking detailed semantic information in free-form answers.

Technical Tags

medical visual question answering (Med-VQA)cross-modal alignmentmulti-task learningMamba architectureself-attentionfree-form answersfine-grained alignmentfeature representationclinical decision supporttelemedicine

Research Topics

Multimodal LearningMedical AIVisual Question AnsweringDeep Learning ArchitecturesClinical Decision Support

Methods & Architectures

Cross-Mamba Interaction (CMI)Multi-Task Learning (MTL)Fine-grained Visual-Text Alignment (FVTA)Cross-modal Interleaved Feature Representation (CIFR)Free-form Answer-Enhanced Multi-Task Learning (FFAE) MambaTransformer (implicitly, for comparison)Multi-Task Learning Framework

Applications & Tasks

Healthcare Medical Diagnosis Telemedicine Clinical Decision Support Ineffective cross-modal alignment in Med-VQALimitations of classification-based methods for free-form answersHandling diverse answers and detailed semantic information Medical Visual Question Answering (Med-VQA)Cross-modal feature learning

Related Fields

Medical AIComputer VisionNatural Language ProcessingDeep LearningHealthcare Technology

Keywords

Med-VQAMedical ImagingVisual Question AnsweringMambaMulti-Task LearningCross-modal AlignmentClinical Decision SupportHealthcare AIDeep LearningFree-form Answers

Academic Context

#Multimodal Learning#Medical AI#Visual Question Answering#Deep Learning Architectures#Clinical Decision Support

Commercial Potential

Potential Products

AI assistants for medical image analysisTools for automated medical report generationTelemedicine support systems

Target Industries

HealthcareBiotechnologyMedical DevicesInformation Technology (Healthcare)

Use Case Examples

Answering questions about a specific finding in an X-ray or MRI scanProviding diagnostic suggestions based on patient images and clinical notes

Competitive Edge

Leverages the Mamba architecture for improved cross-modal alignment in Med-VQA, offering an alternative to self-attention based methods and better handling of free-form answers.

Market Opportunity

Significant and growing market for AI in healthcare, particularly in diagnostics and decision support.

Revenue Models

Licensing of AI modelsintegration servicesSaaS platforms for healthcare providers.

Resource Requirements

Compute Needs

High, especially for training large multimodal models.

Data Requirements

Requires large, annotated medical image and text datasets (e.g., VQA datasets specific to medicine).

Deployment Constraints

Requires integration with hospital IT systems, strict data privacy (HIPAA), and clinical validation.

Scalability

Scalability depends on the Mamba architecture's efficiency and the underlying infrastructure.

Regulatory Considerations

HIPAAFDA regulations for medical devices/software

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate, for the CMI-MTL framework and its specific modules.

View Full Paper Back to Papers