Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Current Retrieval-Augmented Generation (RAG) systems primarily operate on
unimodal textual data, limiting their effectiveness on unstructured multimodal
documents. Such documents often combine text, images, tables, equations, and
graphs, each contributing unique information. In this work, we present a
Modality-Aware Hybrid retrieval Architecture (MAHA), designed specifically for
multimodal question answering with reasoning through a modality-aware knowledge
graph. MAHA integrates dense vector retrieval with structured graph traversal,
where the knowledge graph encodes cross-modal semantics and relationships. This
design enables both semantically rich and context-aware retrieval across
diverse modalities. Evaluations on multiple benchmark datasets demonstrate that
MAHA substantially outperforms baseline methods, achieving a ROUGE-L score of
0.486, providing complete modality coverage. These results highlight MAHA's
ability to combine embeddings with explicit document structure, enabling
effective multimodal retrieval. Our work establishes a scalable and
interpretable retrieval framework that advances RAG systems by enabling
modality-aware reasoning over unstructured multimodal data.
Authors (2)
Rashmi R
Vidyadhar Upadhya
Submitted
October 16, 2025
Key Contributions
This paper introduces MAHA, a Modality-Aware Hybrid retrieval Architecture for multimodal RAG on unstructured data. MAHA leverages a modality-aware knowledge graph and combines dense vector retrieval with graph traversal to achieve semantically rich and context-aware retrieval across text, images, tables, and graphs, significantly outperforming unimodal baselines.
Business Value
Enables more comprehensive and accurate information retrieval and generation from complex, unstructured documents (e.g., reports, manuals, web pages), improving knowledge management and decision support.