BookRAG Framework: Hierarchical Document QA
- BookRAG Framework is a retrieval-augmented generation system that integrates hierarchical document structure with interconnected entity graphs to enhance question answering.
- It builds a BookIndex by parsing documents into a logical tree and fine-grained subgraphs, enabling dynamic, agent-driven query planning and effective information extraction.
- Empirical results demonstrate state-of-the-art gains in retrieval recall and QA accuracy while reducing computational cost compared to flat RAG approaches.
BookRAG is a retrieval-augmented generation (RAG) framework designed to optimize question answering (QA) over complex documents exhibiting explicit hierarchical structure and intricate cross-references. The approach addresses the limitations of flat chunking and naive layout segmentation in standard RAG pipelines by building a document-native index that captures both the logical content hierarchy and the connectivity of entities across document sections. BookRAG introduces the BookIndex—an integrated structure encoding hierarchy and entity graph—and leverages an agent-based query method, inspired by Information Foraging Theory (IFT), to dynamically adapt retrieval workflows to the granularity and connectivity of queries. Empirical evidence demonstrates state-of-the-art gains in retrieval recall and QA accuracy on multi-section, multi-modal real-world documentation while maintaining competitive computational efficiency (Wang et al., 3 Dec 2025).
1. Motivation, Design Principles, and Architecture
BookRAG targets "book-like" documents that exhibit explicit multi-level table-of-contents intent, e.g., chapters, sections, and subsections, as well as distributed, interconnected entities such as terms, figures, and tables. Standard RAG implementations fail to leverage this structure, resulting in loss of both hierarchical context and entity linkage when relying on flat chunking or simplistic page-based parses. BookRAG's design centers on constructing a BookIndex, which comprises:
- A hierarchical tree mirroring the document's logical decomposition (titles, sections, images, etc.).
- A fine-grained knowledge graph representing entities and their interrelations within and across tree nodes.
- A mapping (GT-Link) associating each entity in with its provenance in .
The architecture consists of two phases:
Offline phase:
- Document parsing via MinerU to extract primitive layout blocks with associated content, type (Title, Text, Table, Image), and layout features (font size, bounding box).
- Section-level filtering and level assignment using LLM prompting to recover hierarchy.
- Assembly of blocks into a tree structure based on document order and detected hierarchy levels.
- Per-node entity and relation extraction via LLM (for text) or VLM (for images), producing subgraph fragments further resolved and merged by a gradient-based entity resolution algorithm.
- Embedding of textual and visual content nodes; storage in a vector database for similarity search.
Online phase:
- At query time, an agent classifies query type and constructs an operator plan to traverse, filter, and rank relevant nodes in , culminating in LLM-based answer synthesis.
2. BookIndex Construction: Hierarchy and Representation
2.1 Hierarchical Tree Extraction
- Each document is layout-parsed into blocks , where is content, the block type, and the layout metadata.
- Candidate heading blocks () serve as input to an LLM, queried with block content and context window, for annotation of true hierarchical level and refined type (e.g., to distinguish semantic headers from misclassified elements).
- Assembled nodes are arranged into a tree by anchoring each node to the closest prior node at level in document order.
2.2 Embedding and Node Mapping
- Content and title nodes ( or ) are embedded using text models (e.g., Qwen3-Embedding-0.6B); nodes containing images or formulas are processed by multi-modal models (e.g., gme-Qwen2-VL-2B).
- Similarity searches utilize cosine similarity in the embedding space:
- These embeddings support both semantic retrieval (content scent) and mapping of query entities to location(s) in the hierarchy for subsequent information extraction.
3. Fine-Grained Entity-Graph Construction
3.1 Per-node Entity and Relation Extraction
- Each tree node is processed (via LLM or VLM) to extract entities and intra-node relations . For tables, special Table-type vertices and "ContainedIn" edges encode schema-level relationships.
- Every entity inherits a provisional mapping to its node of origin.
3.2 Gradient-based Entity Resolution (ER)
- To consolidate entity aliases, BookRAG performs gradient-based ER:
- A new entity retrieves top- candidate matches from the vector DB via embedding similarity.
- Candidates are reranked; the process iteratively adds candidates to the merge set as long as the similarity drop remains below threshold .
- If all candidates are included, is a new concept. Otherwise, the LLM is consulted for canonical selection if . is updated accordingly.
3.3 Final Knowledge Graph
- The final represents resolved entities and their aggregated relations across document granularity.
- Edge weights are not explicitly assigned; PageRank is computed on subgraphs for multi-hop importance estimation:
where is a query-relevant induced subgraph.
4. Agent-Based Query Planning and Retrieval Workflow
BookRAG uses an IFT-inspired LLM agent for dynamic classification and workflow orchestration:
4.1 Query Categorization
- Each query is labeled as "single-hop," "multi-hop," or "global" by a dedicated LLM prompt; no probability scores are employed.
4.2 Operator Library and Retrieval Plan
The retrieval process is modularized into operator classes:
- Formulator:
- Decompose:
- Extract:
- Selector:
- Filter_Modal, Filter_Range:
- Select_by_Entity:
- Select_by_Section: Relevant section titles and descendant nodes chosen by LLM.
- Reasoner:
- Graph Reasoning:
- Text Reasoning: assigned by LLM ranking of retrieved nodes.
- Skyline_Ranker: Retain the Pareto frontier under .
- Synthesizer: Aggregates and condenses the selected node content for LLM answer generation.
4.3 Retrieval Plans by Query Type
| Query Type | Plan Structure |
|---|---|
| Single-hop | Extract → Select_by_Entity or Select_by_Section → (Graph ∥ Text) → Skyline → Reduce |
| Multi-hop | Decompose → Single-hop plan per sub-query → Map → Reduce |
| Global | (Filter_Modal ∥ Filter_Range)* → Map → Reduce |
Empirical default parameters: gradient-ER threshold , retrieval top-.
4.4 Workflow Pseudocode
1 2 3 4 5 6 7 8 |
def answer(q): c = LLM_classify_query(q) P = Agent_plan(q,c) N_s = run_selectors(P.selectors) S_G = PageRank+GT_link(N_s, e) S_T = rerank_by_text(N_s, q) N_R = skyline_frontier({(S_G[n],S_T[n]) for n in N_s}) return Synthesizer(q, N_R) |
5. Experimental Evaluation on Long-Form QA
5.1 Datasets
BookRAG is evaluated on challenging document QA benchmarks:
| Dataset | Doc Count | Avg Pages | Images/Figures | QA Pairs |
|---|---|---|---|---|
| MMLongBench | 85 | 42 | 26 | 669 |
| M3DocVQA | 500 | 8.5 | 3.5 | 633 |
| Qasper | 192 | 11 | 3.4 | 640 |
5.2 Evaluation Metrics
- QA Accuracy (Inclusion-Accuracy):
- Exact Match (EM):
- F1 (token overlap):
- Retrieval Recall (block-level):
5.3 Results
| Dataset | BookRAG EM / F1 | Next Best Baseline EM / F1 | BookRAG Retrieval Recall | Next Best Retrieval Recall |
|---|---|---|---|---|
| MMLongBench | 43.8 / 44.9 | 27.5 / 28.6 | 57.6% | 26.4% |
| M3DocVQA | 61.0 / 66.2 | 43.0 / 47.8 | 71.2% | 44.5% |
| Qasper | 55.2 / 61.1 | 42.3 / 50.4 | 63.5% | 33.5% |
BookRAG also achieves substantial efficiency gains: average tokens per query M versus DocETL's M, and latency is roughly faster than DocETL. These improvements are consistent across metrics, though no formal significance tests are reported (Wang et al., 3 Dec 2025).
6. Computational Cost and Scalability
BookRAG's computational profile is characterized by:
- Indexing: Tree construction and per-node entity and relation extraction scale linearly with the number of blocks, . Gradient ER for a new entity is (top), typically much less than for graph-wide batch ER.
- Query-time: Operator filtering and tree traversal is , mitigated by early pruning. PageRank is only computed on small subgraphs (), and skyline ranking over candidates is negligible ().
- Empirical Runtimes: Query response time ranges from to seconds (depending on document size), on par with graph-based RAG and substantially faster than ETL-style systems. Token usage is lower by an order of magnitude.
A plausible implication is that BookRAG's hybridization of structure (hierarchical tree) and connectivity (knowledge graph), in conjunction with IFT-driven agent workflows, enables scaling to increasingly complex long documents without incurring prohibitive inference or indexing costs.
7. Context, Limitations, and Significance
BookRAG advances retrieval-augmented QA in domains where document logic and entity linkage are prominent—e.g., technical books, manuals, and scientific proceedings—by unifying hierarchical and entity-centric views. This offers notable improvements in evidence selection and answer generation over previous flat or layout-based RAG paradigms. No explicit edge weights in the KG and reliance on PageRank for multi-hop importance suggest that incorporating further link analytics or document-type-aware heuristics may yield additional gains. The absence of explicit significance testing constrains the statistical interpretation of improvements; however, the magnitude and consistency of the gains across datasets underscore BookRAG's practical value in structured-document QA (Wang et al., 3 Dec 2025).