arxiv_ai 94% Match Research Paper AI Researchers,NLP Engineers,Information Retrieval Specialists,Developers of AI Search Systems 1 week ago

Model-Document Protocol for AI Search

large-language-models › model-architecture

📄 Abstract

Abstract: AI search depends on linking large language models (LLMs) with vast external knowledge sources. Yet web pages, PDF files, and other raw documents are not inherently LLM-ready: they are long, noisy, and unstructured. Conventional retrieval methods treat these documents as verbatim text and return raw passages, leaving the burden of fragment assembly and contextual reasoning to the LLM. This gap underscores the need for a new retrieval paradigm that redefines how models interact with documents. We introduce the Model-Document Protocol (MDP), a general framework that formalizes how raw text is bridged to LLMs through consumable knowledge representations. Rather than treating retrieval as passage fetching, MDP defines multiple pathways that transform unstructured documents into task-specific, LLM-ready inputs. These include agentic reasoning, which curates raw evidence into coherent context; memory grounding, which accumulates reusable notes to enrich reasoning; and structured leveraging, which encodes documents into formal representations such as graphs or key-value caches. All three pathways share the same goal: ensuring that what reaches the LLM is not raw fragments but compact, structured knowledge directly consumable for reasoning. As an instantiation, we present MDP-Agent, which realizes the protocol through an agentic process: constructing document-level gist memories for global coverage, performing diffusion-based exploration with vertical exploitation to uncover layered dependencies, and applying map-reduce style synthesis to integrate large-scale evidence into compact yet sufficient context. Experiments on information-seeking benchmarks demonstrate that MDP-Agent outperforms baselines, validating both the soundness of the MDP framework and the effectiveness of its agentic instantiation.

Authors (2)

Hongjin Qian

Zheng Liu

Submitted

October 29, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces the Model-Document Protocol (MDP), a framework that formalizes the interaction between LLMs and external documents. MDP defines multiple pathways (agentic reasoning, memory grounding, structured representations) to transform unstructured documents into task-specific, LLM-ready knowledge.

Business Value

Significantly improves the ability of AI search and Q&A systems to leverage vast amounts of unstructured information, leading to more accurate, comprehensive, and context-aware responses. Enhances knowledge discovery and accessibility.

Paper Metadata

Innovation Type

Framework/Protocol Design

Deployment Feasibility

High. The MDP is a conceptual framework that can guide the development of new retrieval and document processing pipelines for LLMs.

Limitations Addressed

Raw documents being LLM-unready (long, noisy, unstructured),Conventional retrieval methods returning raw passages,Burden of assembly and reasoning placed on the LLM

Technical Tags

AI searchlarge language modelsretrieval-augmented generationknowledge representationdocument processingmodel-document protocolagentic reasoningmemory grounding

Research Topics

Information RetrievalLarge Language ModelsKnowledge ManagementArtificial IntelligenceNatural Language Processing

Methods & Architectures

Model-Document Protocol (MDP)Agentic ReasoningMemory GroundingStructured Knowledge Representation Large Language Models (LLMs)

Applications & Tasks

Search Engines Question Answering Systems Knowledge Bases Document Analysis Bridging LLMs and external knowledgeProcessing unstructured documentsImproving retrieval relevanceEnhancing LLM reasoning Transforming raw documents into LLM-ready inputsCurating evidence for LLM reasoningGrounding LLM responses in external knowledge

Related Fields

Information RetrievalNatural Language ProcessingKnowledge RepresentationArtificial IntelligenceSoftware Engineering

Keywords

AI SearchLarge Language ModelsLLMsModel-Document ProtocolMDPInformation RetrievalKnowledge RepresentationDocument ProcessingAgentic ReasoningMemory GroundingUnstructured DataRetrieval-Augmented Generation

Academic Context

#Information Retrieval#Large Language Models#Knowledge Management#Artificial Intelligence#Natural Language Processing

Commercial Potential

Potential Products

Next-generation AI search enginesIntelligent document analysis platformsKnowledge management systemsAdvanced Q&A bots

Target Industries

TechnologyInformation ServicesLegalFinanceResearch

Use Case Examples

Building a search engine that can synthesize information from multiple complex legal documents.Creating a Q&A system that can answer questions based on a large corpus of technical manuals.Developing tools for automated literature review and knowledge synthesis.

Competitive Edge

Proposes a fundamental shift in how LLMs interact with documents, moving beyond simple passage retrieval to a more structured and intelligent knowledge integration process.

Market Opportunity

Massive market for search, knowledge management, and AI-powered information access.

Revenue Models

SaaS for AI search platformsAPI accesslicensing of specialized document processing modules.

Resource Requirements

Compute Needs

High (for running LLMs and complex processing pipelines)

Data Requirements

Vast amounts of unstructured documents (web pages, PDFs, etc.).

Deployment Constraints

Computational cost, latency, complexity of implementing agentic reasoning and memory systems.

Scalability

The protocol is designed to handle large volumes of documents and LLMs, suggesting good scalability.

Regulatory Considerations

Data privacy and copyright issues related to document processing.

Production Readiness

Maturity Level

Conceptual Framework/Research

Time to Market

1-3 years

Patent Potential

Moderate, for specific implementations of agentic reasoning or memory grounding within the MDP.

View Full Paper Back to Papers