Papers
Topics
Authors
Recent
Search
2000 character limit reached

DeepRead: Structure-Aware Document QA

Updated 6 February 2026
  • DeepRead is a structure-aware, multi-turn QA framework that leverages LLM-based OCR and precise indexing to emulate human locate-then-read practices.
  • It integrates two tools—Retrieve for targeted localization and ReadSection for sequential extraction—to overcome context fragmentation in long documents.
  • Experimental evaluations on benchmarks like ContextBench and SyllabusQA demonstrate substantial improvements in accuracy and search efficiency over traditional methods.

DeepRead is a document structure-aware, multi-turn reasoning agentic framework for long-document question answering (QA), specifically designed to operationalize document-native hierarchical and sequential priors within Retrieval-Augmented Generation (RAG) systems. Leveraging LLM-based optical character recognition (OCR), precise paragraph-level indexing, and a complementary toolset that mirrors human “locate-then-read” practices, DeepRead outperforms flat-chunk and traditional agentic search methods, demonstrating substantial improvements in both accuracy and search efficiency in long-context scenarios (Li et al., 4 Feb 2026).

1. Motivations and High-Level Foundations

Standard agentic RAG paradigms treat long documents—such as scientific papers, reports, or syllabi—as collections of flat, unstructured chunks, disregarding native properties such as heading hierarchies, sectioning, and sequential paragraph order. Human reading, in contrast, employs a “locate-then-read” workflow: readers first identify relevant sections using headings and then consume contiguous paragraphs in logical order. DeepRead’s central thesis is to expose these hierarchical and sequential cues to the LLM agent, enabling strategies that more closely resemble human document navigation. This mitigates context fragmentation, omission errors, and inefficient search behaviors commonly observed in flat-chunk agentic systems.

DeepRead’s core features are:

  • LLM-based OCR conversion of complex document formats (e.g., PDF) into structured Markdown preserving heading and paragraph boundaries.
  • Rigorous per-paragraph indexing with coordinate-style metadata encoding section identities and in-section ordinal positions.
  • Two structurally aware tools—Retrieve (for precise localization with lightweight context) and ReadSection (for sequential, order-preserving reading within specified section boundaries).

2. Document Transformation and Structural Representation

Each source document is first passed through an advanced LLM-based OCR/markup engine (e.g., PaddleOCR-VL, DeepSeek-OCR), trained to produce Markdown representations aligned with both the logical and sequential structure of academic and technical documents. This process emits:

  • Heading tokens at the correct hierarchical level (‘#’, ‘##’, etc.), faithfully encoding multi-level TOCs.
  • Paragraph blocks as indivisible units, explicitly maintaining original authors’ segmentation.
  • Inline structures (lists, tables) using Markdown conventions.

The outputted Markdown captures an explicit tree structure: each node corresponds to a section or subsection, and each leaf or interior node contains an ordered list of “atomic” paragraphs (no arbitrary sliding window chunking is applied). For each paragraph pi,j(d)p^{(d)}_{i,j} in document dd, DeepRead attaches metadata

Γd,i,j={  doc_id:d,  sec_id:i,  para_idx:j  }\Gamma_{d,i,j} = \{\; doc\_id: d,\; sec\_id: i,\; para\_idx: j \;\}

which allows precise referencing and tool-chaining based on document structure.

3. Retrieval and Reading Tools: API and Synergy

DeepRead’s orchestration protocol enables two agent tools, both anchored to the document’s structural coordinates.

Retrieve performs a “scanning” search:

  • Input: textual query string.
  • Mechanism: computes dense similarity to all paragraphs; for each of the top-KK ranked paragraph hits, returns a window (w,w)(w^\uparrow, w^\downarrow) of surrounding paragraphs, maintaining semantic locality.
  • Output: ordered, deduplicated list of matching paragraphs, each returned with metadata Γd,i,j\Gamma_{d,i,j}.

ReadSection enables sequential, deep reading inside a bounded section range:

  • Input: section coordinate (d,i)(d, i) and paragraph index range [js,je][j_s, j_e].
  • Output: all ordered paragraphs pi,j(d)p^{(d)}_{i,j} with jsjjej_s \leq j \leq j_e and metadata, maintaining original discourse continuity.

This dual-tool setup creates a synergistic workflow: Retrieve is used for rapid localization (anchoring to relevant section/paragraph), and ReadSection is used for extracting all contiguous evidence within a relevant section, closely emulating human selective reading.

4. Multi-Turn Agentic Orchestration and Locate-Then-Read Behavior

The agent is initialized with a system prompt containing a full TOC skeleton, including section titles, numbers of child sections, paragraph counts, and associated token metrics. During interaction:

  • The agent’s first action is almost always Retrieve, supporting initial evidence localization.
  • Upon identifying a structurally relevant section, the agent issues ReadSection for the range (jsj_s to jej_e), performing a deep, contiguous extraction.

A representative policy loop:

1
2
3
4
5
6
7
8
9
10
Construct system prompt with TOC(D)
State s  [SystemPrompt, UserQuestion]
for t in 1T:
  a  LLM.policy(s)
  if a==FINAL: return a.answer
  if a.tool==Retrieve:
    o  Retrieve(a.query)
  else if a.tool==ReadSection:
    o  ReadSection(a.doc_id,a.sec_id,a.start,a.end)
  s  s  (a,o)
This confirms an explicit locate (Retrieve)–then–read (ReadSection) pattern. The agent may iterate multiple cycles when necessary, but typically the sequence is Retrieve → ReadSection → FINAL(answer).

5. Experimental Evidence and Performance Metrics

DeepRead’s evaluation spans challenging long-document QA settings:

  • FinanceBench (∼165K tokens/doc),
  • ContextBench (∼233K tokens/doc),
  • QASPER (multi-doc crowd QA),
  • SyllabusQA (multi-doc course QA).

Comparison baselines include Dense RAG, ITRG, RAPTOR, and Search-o1 agentic search methods, with both flat chunking and expand-window variants. The table below summarizes main quantitative findings (Li et al., 4 Feb 2026):

Method FinBench ContextBench SyllabusQA Multi-Doc Avg Overall
Search-o1 80.0 74.5 57.1 61.1 69.2
DeepRead 82.7 91.5 70.9 71.8 79.5
DeepRead w/ expand 84.0 88.3 72.5 74.4 80.3

Key empirical insights:

  • DeepRead achieves +17 points over Search-o1 on ContextBench and +13.8 points on SyllabusQA. These datasets demand precise, contiguous evidence collection, substantiating the value of structural reasoning.
  • Adding an expand window improves Search-o1 but has little or negative effect for DeepRead, since ReadSection natively guarantees in-order, coherent context.
  • Judgement using multiple LLM-based evaluators (DeepSeek V3.2, GLM-4.7, Qwen3-235B) corroborates the robustness of DeepRead’s relative gains (inter-judge agreement >0.88).
  • Retrieve→Read tool usage ratios in DeepRead approach unity on challenging datasets (e.g., S→R 95.7% on ContextBench), indicating highly disciplined adherence to the locate-then-read paradigm.

6. Behavioral Analysis and Efficiency

Fine-grained process analyses confirm that both DeepRead and agentic flat-chunk baselines begin with Retrieve in >95% of episodes. However, DeepRead systematically follows search calls with ReadSection, resulting in tightly scoped reading spans. While DeepRead executes marginally more tool calls (mean 8.7 vs. 7.7 for Search-o1) and incurs slight additional token consumption, its efficiency remains markedly superior to approaches based on knowledge-graph construction or iterative summarization.

Incorrect answers correlate robustly with increased tool calls and token usage, suggesting that pathologically long or looping search sequences may serve as behavioral signals of failure.

7. Limitations and Future Directions

Noted limitations include:

  • Dependency on high-fidelity OCR-to-Markdown transformation; errors in structural parsing can propagate downstream.
  • The approach assumes documents are sufficiently regular in their usage of headers and paragraph boundaries; highly nonstandard formats may degrade retrieval accuracy.
  • The current ReadSection tool enforces strict in-order reading within one section and cannot span complex multi-section evidence requirements in a single call.

Potential future improvements, as suggested, include introducing cross-sectional read primitives, learned policies for dynamic tool selection and window-length adjustment, and tighter OCR-integrated error correction.

8. Significance and Broader Impact

By instantiating a structure-aware, coordinate-indexed agent protocol for long-document QA, DeepRead bridges the gap between LLM-based RAG and human document analysis practices. This reduces omission and fragmentation errors and sharply improves retrieval accuracy in settings where evidence is highly localized and contiguous. The approach operationalizes a “locate-then-read” paradigm grounded in document structure, thus advancing the state of document-grounded AI reasoning at scale (Li et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepRead.