arxiv_cl 95% Match Research Paper Researchers,Knowledge Workers,Data Scientists,AI Developers 1 week ago

Doc-Researcher: A Unified System for Multimodal Document Parsing and Deep Research

large-language-models › multimodal-llms

📄 Abstract

Abstract: Deep Research systems have revolutionized how LLMs solve complex questions through iterative reasoning and evidence gathering. However, current systems remain fundamentally constrained to textual web data, overlooking the vast knowledge embedded in multimodal documents Processing such documents demands sophisticated parsing to preserve visual semantics (figures, tables, charts, and equations), intelligent chunking to maintain structural coherence, and adaptive retrieval across modalities, which are capabilities absent in existing systems. In response, we present Doc-Researcher, a unified system that bridges this gap through three integrated components: (i) deep multimodal parsing that preserves layout structure and visual semantics while creating multi-granular representations from chunk to document level, (ii) systematic retrieval architecture supporting text-only, vision-only, and hybrid paradigms with dynamic granularity selection, and (iii) iterative multi-agent workflows that decompose complex queries, progressively accumulate evidence, and synthesize comprehensive answers across documents and modalities. To enable rigorous evaluation, we introduce M4DocBench, the first benchmark for Multi-modal, Multi-hop, Multi-document, and Multi-turn deep research. Featuring 158 expert-annotated questions with complete evidence chains across 304 documents, M4DocBench tests capabilities that existing benchmarks cannot assess. Experiments demonstrate that Doc-Researcher achieves 50.6% accuracy, 3.4xbetter than state-of-the-art baselines, validating that effective document research requires not just better retrieval, but fundamentally deep parsing that preserve multimodal integrity and support iterative research. Our work establishes a new paradigm for conducting deep research on multimodal document collections.

Authors (12)

Kuicai Dong

Shurui Huang

Fangda Ye

Wei Han

Zhi Zhang

Dexun Li

+6 more

Submitted

October 24, 2025

arXiv Category

cs.IR

arXiv PDF

Key Contributions

Presents Doc-Researcher, a unified system that overcomes LLM limitations to textual data by enabling deep research on multimodal documents. It features deep multimodal parsing preserving visual semantics and layout, a retrieval architecture supporting hybrid paradigms with dynamic granularity, and iterative multi-agent workflows.

Business Value

Significantly enhances the ability of organizations to extract and leverage knowledge from diverse document types (reports, manuals, presentations), accelerating research and decision-making.

Paper Metadata

Innovation Type

System/Framework

Deployment Feasibility

Moderate, requires integration of multiple components and potentially specialized hardware for multimodal processing.

Limitations Addressed

Current deep research systems are constrained to textual web data, overlooking vast knowledge in multimodal documents, and lacking capabilities for sophisticated parsing, structural coherence, and cross-modal retrieval.

Technical Tags

multimodal document parsingdeep researchLLM systemsiterative reasoningevidence gatheringvisual semanticsdocument structureadaptive retrievalmulti-agent workflowsmulti-granular representations

Research Topics

Multimodal AIDocument UnderstandingInformation RetrievalReasoning SystemsLarge Language Models

Methods & Architectures

Deep multimodal parsingMulti-granular representation creationSystematic retrieval architectureText-only retrievalVision-only retrievalHybrid retrievalDynamic granularity selectionIterative multi-agent workflows

Applications & Tasks

Document Analysis Research Assistance Knowledge Management LLMs constrained to textual dataOverlooking knowledge in multimodal documentsSophisticated parsing needsPreserving visual semanticsMaintaining structural coherenceAdaptive retrieval across modalities Multimodal document parsingDeep research using documentsAnswering complex questions from documentsInformation extraction from multimodal sources

Related Fields

Natural Language ProcessingComputer VisionInformation RetrievalDocument Image AnalysisArtificial Intelligence

Keywords

multimodal documentsLLMdeep researchparsingretrievalvisual semanticsdocument structuremulti-agentreasoninginformation extractionunified system

Academic Context

#Multimodal AI#Document Understanding#Information Retrieval#Reasoning Systems#Large Language Models

Commercial Potential

Potential Products

Intelligent document analysis platformsAutomated research assistantsKnowledge discovery tools

Target Industries

LegalFinanceAcademiaPublishingEngineering

Use Case Examples

Analyzing complex scientific papers with figures and tables to answer research questions.Extracting key information from financial reports and presentations.Building a knowledge base from technical manuals and product documentation.

Competitive Edge

Provides a comprehensive solution for multimodal document understanding, going beyond text-only LLM capabilities for deep research tasks.

Market Opportunity

Large market for document intelligence and knowledge management solutions.

Revenue Models

SaaS subscriptionsAPI accesscustom enterprise solutions.

Resource Requirements

Compute Needs

High (for multimodal processing and LLM inference)

Data Requirements

Diverse multimodal documents (PDFs, images, etc.)

Deployment Constraints

Requires robust OCR, layout analysis, and multimodal fusion capabilities.

Scalability

Scalability depends on the efficiency of parsing and retrieval components.

Production Readiness

Maturity Level

Research Prototype

Time to Market

2-4 years for a commercial product.

Patent Potential

Moderate (for novel parsing or retrieval techniques)

View Full Paper Back to Papers