Papers
Topics
Authors
Recent
2000 character limit reached

BookRAG: Hierarchical RAG for Complex Docs

Updated 4 December 2025
  • BookRAG is a retrieval-augmented generation framework designed for answering questions on hierarchically and relationally structured documents.
  • It integrates a document-native indexing scheme called BookIndex, which unifies a logical tree structure with a fine-grained knowledge graph.
  • Its agent-based planner, inspired by Information Foraging Theory, enables dynamic query workflows that markedly improve retrieval recall and answer precision.

BookRAG is a Retrieval-Augmented Generation (RAG) framework constructed to address question answering (QA) over complex, hierarchically structured documents, including textbooks, handbooks, API manuals, and scientific papers. It introduces a document-native indexing scheme—BookIndex—that systematically incorporates both logical document hierarchy and fine-grained cross-entity relationships. BookRAG differs from standard text- or block-based RAG systems by leveraging a unified tree–graph structure, and it employs an agent-based planner inspired by Information Foraging Theory to customize retrieval plans per query (Wang et al., 3 Dec 2025). Its design achieves substantial gains in retrieval recall and answer generation accuracy compared to prior RAG baselines, particularly for documents with nested, interdependent content.

1. Motivations and Limitations of Existing RAG Systems

Complex documents are almost invariably organized via explicit multi-level hierarchies, such as chapters, sections, and subsections, and often embed tables, figures, and cross-references (Wang et al., 3 Dec 2025). Existing RAG paradigms, including both text-based chunking plus vector/knowledge-graph retrieval and layout-segmented retrieval, either ignore or inadequately model such structures. This omission leads to two principal deficiencies:

  • Structural–semantic mismatch: Text-driven or block-driven chunking fails to exploit nested logical hierarchies and the document-local entity network simultaneously.
  • Inflexible query workflows: Pipeline uniformity precludes responsiveness to query complexity, as QA tasks range from lookups to multi-hop reasoning and document-wide aggregation.

Real-world scenarios, such as technical support or AI-tutoring on STEM textbooks, demand high page- or section-precision, coupled with the capacity to assemble interrelated content spanning several hierarchical levels (Chen et al., 20 Sep 2025).

2. BookIndex: Hierarchical and Relational Document Representation

The core data structure of BookRAG is the BookIndex, formally defined as

B=(T,  G,  M)B = \bigl(T,\;G,\;M\bigr)

where:

  • T=(N,ET)T = (N, E_T): a hierarchical tree (pseudo-Table of Contents) whose nodes NN are logical blocks (chapters, sections, tables, figures, and plain text), and whose edges ETE_T encode parent–child nesting relationships.
  • G=(V,EG)G = (V, E_G): a knowledge graph with vertices VV as fine-grained entities (extracted from each block) and relation triples EGE_G.
  • M:VP(N)M: V \rightarrow \mathcal{P}(N): the GT-Link—a mapping from each entity to the set of tree nodes (logical blocks) in which it appears (Wang et al., 3 Dec 2025).

Construction Workflow

Tree Construction proceeds via three stages:

  1. Layout Parsing: Tools such as MinerU segment the input PDF/image into primitive blocks, identifying types (Title/Text/Table/Image) and features.
  2. Section Filtering & Level Assignment: Title blocks are passed to an LLM for hierarchical level (j\ell_j) assignment and correction.
  3. Tree Assembly: Blocks are organized into NN (all blocks with valid levels and content), and edges ETE_T reflect the hierarchical nesting.

Graph Construction entails:

  • Entity & Relation Extraction: For text blocks, LLMs extract entities and relations; vision–LLMs handle tables and images.
  • Entity Resolution: A gradient-based scoring algorithm matches new entity candidates to canonical entries in the vector store, merging or adding as needed.
  • GT-Link Mapping: Each entity records its source logical blocks, supporting navigable cross-references and grounded retrieval.

3. Agent-Based Query Method and Dynamic Workflow Planning

BookRAG employs an agent-based query planner, aligning QA execution with the Information Foraging Theory framework. The workflow incorporates three stages:

  1. Agent-Based Planning: An LLM first classifies queries into one of three categories:
    • Single-hop: direct local lookup.
    • Multi-hop: independent retrieval across multiple locations.
    • Global Aggregation: filtering and summarizing content across the document.
  2. Scent/Filter-Based Retrieval: Operators systematically prune the candidate node set NN, using entity links (GT-Links), semantic selectors, and relevance criteria.
  3. Generation (Map–Reduce): The system generates partial answers on small, focused blocks (Map phase) and then aggregates them (Reduce phase).

The operator library includes:

  • Formulator: Entity extraction and query decomposition.
  • Selector: Filtering nodes by type, range, or entity association.
  • Reasoner: Local graph-based reasoning (e.g., seeded PageRank), text similarity reranking, and skyline ranking (Pareto-optimal selection over multiple scores).
  • Synthesizer: Map (block-level answer generation) and Reduce (aggregation/final synthesis) (Wang et al., 3 Dec 2025).

4. Retrieval and Generation Pipeline

The retrieval pipeline maximizes both efficiency and retrieval accuracy by:

  • Pruning the working set to 7–10 nodes on average per query via selector operations.
  • Applying multi-dimensional reasoners, then retaining all non-dominated candidates through the Skyline operator instead of strict top-kk selection, countering over-inclusion.
  • Using KG-guided retrieval to constrain context size while preserving inter-entity dependencies.
  • Segregating context input to the LLM through Map–Reduce separation, such that large queriable document contexts are never concatenated at once.

This architecture sharply reduces token usage and query latency compared to alternatives such as DocETL, which exceeds 53M tokens on the MMLongBench corpus versus <5M for BookRAG (Wang et al., 3 Dec 2025).

5. Empirical Evaluation and Performance Characteristics

BookRAG was benchmarked on MMLongBench, M3DocVQA, and Qasper, representing a range of complex, deeply structured documents. In these evaluations:

Method MMLongBench (EM/F1, %) M3DocVQA (EM/F1, %) Qasper (Acc/F1, %)
DocETL 27.5 / 28.6 40.9 / 43.3 42.3 / 50.4
BookRAG 43.8 / 44.9 (+16) 61.0 / 66.2 (+18) 55.2 / 61.1 (+9)

On retrieval recall for ground-truth blocks:

Method MMLongBench M3DocVQA Qasper
GraphRanker 26.4 % 44.5 % 28.6 %
BookRAG 57.6 % (+31) 71.2 % (+27) 63.5 % (+35)

Average query time is 1–3 seconds per query, with BookRAG being 2× faster than DocETL (Wang et al., 3 Dec 2025). These results demonstrate substantial improvements in both end-to-end QA accuracy and retrieval faithfulness.

6. Design Insights, Limitations, and Future Directions

Critical lessons from the BookRAG and comparative GraphRAG experiments include the following:

  • Page- or block-level chunking is essential: Too fine (entities) or too coarse (chapters) granularity undermines retrieval precision (Chen et al., 20 Sep 2025).
  • Context-window control (<4K tokens): larger contexts degrade answer quality and provoke hallucinations.
  • Hybridization: Embedding-based recall can be further enhanced by selectively augmenting with graph neighbors for central entities, and by graph pruning or dynamic context management (Chen et al., 20 Sep 2025).
  • Skyline ranking: Outperforms top‐kk for retaining both high-coverage and high-precision answer contexts.

Remaining limitations include retrieval errors and occasional over-decomposition of simple queries by the agent planner. Future research directions involve scaling BookIndex construction to multi-document or corporate settings, integrating document-native databases for live indexing and querying, and further parallelizing index and entity resolution pipelines (Wang et al., 3 Dec 2025).

7. Comparative Context: BookRAG within RAG Methodology

BookRAG is the first RAG approach built on a document-native index that unifies a logical hierarchy (pseudo-Table of Contents) with a fine-grained knowledge graph. Prior systems such as embedding-based RAG and GraphRAG have demonstrated complementary strengths—high retrieval accuracy versus high concept coverage—but also exhibit context-length sensitivity and structural limitations (Chen et al., 20 Sep 2025). BookRAG’s architecture generalizes these approaches by allowing query workflows adaptive to structure and content, and achieves state-of-the-art QA performance across several benchmarks, with efficiency suitable for enterprise-scale deployment (Wang et al., 3 Dec 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to BookRAG.