MDP: Structuring Document Data for LLMs
- MDP is a framework that converts verbose, noisy document collections into structured, LLM-ready representations through multi-stage transformations.
- It integrates agentic reasoning, memory grounding, and structured leveraging to mitigate data chaos and enhance multi-step information retrieval.
- Empirical evaluations of MDP-Agent demonstrate improved retrieval accuracy and efficiency, achieving notable gains over conventional methods.
Searching arXiv for the specified paper to ground the article and citation. The Model-Document Protocol (MDP) is a general framework for AI search that formalizes how raw documents are bridged to LLMs through consumable knowledge representations rather than verbatim passage retrieval (Qian et al., 29 Oct 2025). It addresses the setting in which a corpus of web pages, PDF files, and related sources is first reduced to a noisy subset by conventional retrieval, yet remains long, redundant, and unstructured. In MDP, this condition is treated as “Data Chaos,” and the central objective is to transform such raw material into compact, structured knowledge that is directly usable for reasoning. The framework is instantiated in MDP-Agent, an agentic system that combines document-level gist memories, diffusion-based exploration with vertical exploitation, and map-reduce style synthesis to improve information-seeking performance on multi-step benchmarks (Qian et al., 29 Oct 2025).
1. Formal definition and problem setting
MDP is defined over a raw document corpus . A simple retrieval step, whether keyword-based or embedding-based, produces a subset , but the resulting material still embodies “Data Chaos”: long, noisy, and unstructured text (Qian et al., 29 Oct 2025). The framework introduces a multi-stage transformation that converts into a representation consumable by an LLM :
Here denotes the “protocol specification,” described as an abstract policy that unifies multiple transformation pathways (Qian et al., 29 Oct 2025). The protocol therefore redefines retrieval not as passage fetching, but as the construction of task-specific, LLM-ready inputs.
The paper further specifies that conditioning the model under MDP takes the form
and that the transformed representation satisfies
so that what reaches the model has substantially lower entropy and higher signal-to-noise than the raw retrieved subset (Qian et al., 29 Oct 2025). Within the paper’s terminology, this entropy reduction is the formal expression of the shift from unstructured evidence to compact knowledge.
A plausible implication is that MDP should be understood less as a single retrieval algorithm than as an interface layer between retrieval and reasoning. The paper supports this interpretation by presenting MDP as a “general framework” and a “multi-pathway interface” rather than a monolithic method (Qian et al., 29 Oct 2025).
2. The three transformation pathways
The transformation 0 factorizes into three complementary pathways:
1
The paper names these pathways agentic reasoning, memory grounding, and structured leveraging (Qian et al., 29 Oct 2025). Each pathway addresses a distinct failure mode of treating retrieval as raw text return.
Agentic reasoning
The agentic pathway, 2, is defined as iterative evidence curation (Qian et al., 29 Oct 2025). Complex tasks are modeled as a knowledge chain
3
where each stage is itself a conjunction of atomic subspaces:
4
These atomic subspaces are retrieved via single queries, and MDP-Agent operates by issuing an information intent 5 at each step, decomposing that intent into atomic queries 6, checking sufficiency, and advancing the reasoning only after the current intent has been resolved (Qian et al., 29 Oct 2025). The paper formalizes this sequential composition through a chain of operators 7:
8
This formulation emphasizes that the system is intended to accumulate structured intermediate knowledge rather than repeatedly prompting the model over raw fragments.
Memory grounding
The memory pathway, 9, creates a textual “gist memory” 0 for each raw document 1 using a long-context LLM (Qian et al., 29 Oct 2025). These gists summarize global themes and structural cues, including titles, topics, and section hierarchy. The paper then indexes 2 densely as a vector 3 while indexing the raw document sparsely under BM25. Query-document relevance is defined as
4
The stated role of gist memories is persistent and reusable note-taking: they filter irrelevant documents cheaply and guide exploration with historical context (Qian et al., 29 Oct 2025). In the paper’s framing, this is not merely summarization; it is a retrieval substrate that supplies global document abstractions.
Structured leveraging
The structured pathway, 5, encodes curated evidence from the other pathways into formal representations such as relational graphs, key-value caches, or symbolic schemas (Qian et al., 29 Oct 2025). These structures are intended to support direct tool-style consumption by LLMs and further algorithmic processing, while also enhancing interpretability and cross-task reuse.
This suggests that MDP is designed to be extensible beyond the specific realization described in the paper. The inclusion of graphs, caches, and schemas indicates that the protocol is intended to accommodate multiple downstream representations as long as they convert document evidence into compact, structured knowledge (Qian et al., 29 Oct 2025).
3. MDP-Agent as an instantiation of the protocol
MDP-Agent is presented as the agentic instantiation of MDP and is implemented in two stages: data indexing with gist memory and agentic knowledge discovery (Qian et al., 29 Oct 2025). The first stage constructs a searchable memory layer over the corpus; the second stage uses that layer to support multi-step reasoning.
In data indexing, each raw document 6 is processed by a long-context LLM to produce a gist memory 7, containing title information, structural cues, and high-level topics (Qian et al., 29 Oct 2025). The gist is encoded into a dense vector 8, while the original document is indexed under BM25. At query time, 9 is computed using the combined dense and sparse relevance score, and top candidates are selected for downstream exploration.
In agentic knowledge discovery, the input is a query 0 and the output is a structured context 1 for the LLM 2 (Qian et al., 29 Oct 2025). The system initializes a context chain, plans an initial intent 3 through 4, and iterates until the task is resolved. At each step it decomposes the current intent into atomic queries, retrieves documents, checks whether the evidence is sufficient, and if necessary generates additional conditioned queries. After evidence collection, the system performs filtering, parallel evidence extraction, synthesis into a compact subspace 5, formatting into a context component 6, and transition to the next intent.
The paper’s pseudocode makes clear that MDP-Agent is organized around explicit state transitions rather than ad hoc prompting. The resulting context is a concatenation of structured intent-knowledge pairs, represented as a chain of the form 7 (Qian et al., 29 Oct 2025). This organization is central to the claim that the final input to the LLM is compact yet sufficient.
4. Exploration, exploitation, and synthesis mechanisms
Three mechanisms are singled out as key operational components of MDP-Agent: Diffusive Wide Exploration, Vertical Exploitation, and Memory-Guided Parallel Synthesis (Qian et al., 29 Oct 2025).
Diffusive Wide Exploration addresses intent misalignment. After the first round of queries, if the current knowledge subspace 8 is incomplete, the system generates conditioned queries to “diffuse” the frontier until coverage is sufficient (Qian et al., 29 Oct 2025). The term “wide” refers to broadening the search over potentially relevant evidence rather than committing prematurely to an incomplete interpretation of the intent.
Vertical Exploitation refers to the sequential resolution of intents in order to uncover layered dependencies (Qian et al., 29 Oct 2025). The paper’s description emphasizes that difficult information-seeking tasks are not flat retrieval problems; they are composed of dependencies that must be resolved stage by stage. The horizontal breadth of diffusion is thus paired with a vertical ordering over reasoning steps.
Memory-Guided Parallel Synthesis is formalized as
9
where 0 filters retrieved documents by gist relevance, 1 extracts fine-grained evidence, and 2 reduces the extracted evidence into a compact subspace (Qian et al., 29 Oct 2025). In the accompanying explanation, this is described as a map-reduce style synthesis in which bulk text processing is handled in parallel, allowing the central LLM to focus on higher-value reasoning.
The paper interprets these mechanisms as a response to fragmentary retrieval. Diffusion seeks broad evidence coverage, vertical exploitation resolves dependency structure, and map-reduce synthesis integrates evidence at scale into a compact context (Qian et al., 29 Oct 2025). A plausible implication is that MDP-Agent is designed to distribute work across retrieval, filtering, extraction, and synthesis so that the final reasoning stage is not overloaded by raw textual volume.
5. Empirical evaluation
The empirical study evaluates MDP-Agent on two information-seeking benchmarks: GAIA and WebWalkerQA (Qian et al., 29 Oct 2025). GAIA is described as comprising 103 text-only multi-step queries and is reported with Exact Match at three difficulty levels. WebWalkerQA contains 680 long-horizon queries requiring web traversal and is reported with LLM Equivalence Accuracy (Qian et al., 29 Oct 2025).
The comparison set includes three baseline categories: direct reasoning with standalone LLMs, retrieval-augmented generation, and tool-integrated reasoning (Qian et al., 29 Oct 2025). The named systems include Qwen2.5-32B, QwQ-32B, GPT-4o, Gemini-2.5-Flash, DeepSeek-R1-671B, Vanilla RAG, query planning, iterative RAG, ReAct, Search-o1, and WebThinker.
On GAIA, MDP-Agent using QwQ-32B achieves Exact Match of 61.5, 46.2, and 33.3 on Levels 1, 2, and 3 respectively, for an average of 50.5%; the next best reported system is WebThinker-32B at 48.5% (Qian et al., 29 Oct 2025). On WebWalkerQA, MDP-Agent attains 53.1, 55.0, and 50.8 on Easy, Medium, and Hard, with an average of 53.1%; this is reported as a 6.6 percentage point gain over WebThinker at 46.5% (Qian et al., 29 Oct 2025). The paper states that these gains hold across baselines and therefore validate both the protocol and the agentic instantiation.
| Benchmark | MDP-Agent result | Comparison stated in the paper |
|---|---|---|
| GAIA | EM = 61.5/46.2/33.3, Avg 50.5% | Next best: WebThinker-32B, 48.5% |
| WebWalkerQA | 53.1/55.0/50.8, Avg 53.1% | 6.6 pp over WebThinker, 46.5% |
The ablation and analysis section reports several further findings (Qian et al., 29 Oct 2025). Stronger central reasoning models such as QwQ-32B and Qwen3-30B-A3B yield higher accuracy, while MDP-Agent consistently outperforms a tool-integrated reasoning baseline, Search-o1. The paper also reports transferability: the LLM-ready context produced by MDP-Agent improves downstream models ranging from Qwen3-8B to GPT-5. With respect to exploration depth, performance increases from diffusion depth 1 to 5 before saturating, and deeper diffusion reduces search intents while converging more reliably. In efficiency terms, gist filtering discards approximately 90% of irrelevant pages, and parallel map-reduce keeps reasoning tokens at approximately 8.9K compared with approximately 227K for bulk text processing (Qian et al., 29 Oct 2025).
6. Interpretation, significance, and boundaries
The paper characterizes the effectiveness of MDP and MDP-Agent through several linked claims (Qian et al., 29 Oct 2025). First, MDP transforms “Data Chaos” into “Knowledge Order” by reducing contextual entropy and producing minimal, structured evidence. Second, the separation into agentic reasoning, memory grounding, and structured leveraging creates an extensible interface between retrieval and LLM reasoning. Third, gist memory improves both coverage and precision in retrieval, while diffusive exploration and vertical exploitation jointly address broad fact coverage and layered dependency resolution. Fourth, map-reduce synthesis preserves scalability by delegating bulk text processing to lightweight auxiliary models in parallel.
These claims position MDP against a common assumption in retrieval-augmented systems: that the essential problem is to fetch better passages. The protocol instead treats the critical issue as the form in which document knowledge is exposed to the model (Qian et al., 29 Oct 2025). In that sense, MDP is not a rejection of retrieval but a reformulation of retrieval’s output contract.
A common misconception would be to equate MDP with a single agent pipeline or with gist summarization alone. The paper’s formalization argues against both interpretations. MDP is defined as a framework with three pathways, and gist memory occupies only one of them (Qian et al., 29 Oct 2025). Similarly, MDP-Agent is an instantiation rather than the entirety of the protocol. Another possible misconception would be to view structured leveraging as mandatory graph construction; the paper instead gives graphs, key-value caches, and symbolic schemas as examples, indicating that the structured pathway is broader than any single representation (Qian et al., 29 Oct 2025).
The paper’s scope is specifically AI search over external knowledge sources such as web pages and PDFs (Qian et al., 29 Oct 2025). A plausible implication is that the protocol is most relevant when the bottleneck lies in transforming large, noisy, heterogeneous document sets into compact reasoning substrates. Within that scope, the reported results suggest that the protocol’s principal contribution is architectural: it defines how models should interact with documents, not merely how documents should be ranked.