Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Document images encapsulate a wealth of knowledge, while the portability of
spoken queries enables broader and flexible application scenarios. Yet, no
prior work has explored knowledge base question answering over visual document
images with queries provided directly in speech. We propose TextlessRAG, the
first end-to-end framework for speech-based question answering over large-scale
document images. Unlike prior methods, TextlessRAG eliminates ASR, TTS and OCR,
directly interpreting speech, retrieving relevant visual knowledge, and
generating answers in a fully textless pipeline. To further boost performance,
we integrate a layout-aware reranking mechanism to refine retrieval.
Experiments demonstrate substantial improvements in both efficiency and
accuracy. To advance research in this direction, we also release the first
bilingual speech--document RAG dataset, featuring Chinese and English voice
queries paired with multimodal document content. Both the dataset and our
pipeline will be made available at
repository:https://github.com/xiepeijinhit-hue/textlessrag