BookRAG Framework: Hierarchical Document QA

Updated 5 December 2025

BookRAG Framework is a retrieval-augmented generation system that integrates hierarchical document structure with interconnected entity graphs to enhance question answering.
It builds a BookIndex by parsing documents into a logical tree and fine-grained subgraphs, enabling dynamic, agent-driven query planning and effective information extraction.
Empirical results demonstrate state-of-the-art gains in retrieval recall and QA accuracy while reducing computational cost compared to flat RAG approaches.

BookRAG is a retrieval-augmented generation (RAG) framework designed to optimize question answering (QA) over complex documents exhibiting explicit hierarchical structure and intricate cross-references. The approach addresses the limitations of flat chunking and naive layout segmentation in standard RAG pipelines by building a document-native index that captures both the logical content hierarchy and the connectivity of entities across document sections. BookRAG introduces the BookIndex—an integrated structure encoding hierarchy and entity graph—and leverages an agent-based query method, inspired by Information Foraging Theory (IFT), to dynamically adapt retrieval workflows to the granularity and connectivity of queries. Empirical evidence demonstrates state-of-the-art gains in retrieval recall and QA accuracy on multi-section, multi-modal real-world documentation while maintaining competitive computational efficiency (Wang et al., 3 Dec 2025).

1. Motivation, Design Principles, and Architecture

BookRAG targets "book-like" documents that exhibit explicit multi-level table-of-contents intent, e.g., chapters, sections, and subsections, as well as distributed, interconnected entities such as terms, figures, and tables. Standard RAG implementations fail to leverage this structure, resulting in loss of both hierarchical context and entity linkage when relying on flat chunking or simplistic page-based parses. BookRAG's design centers on constructing a BookIndex, which comprises:

A hierarchical tree $T$ mirroring the document's logical decomposition (titles, sections, images, etc.).
A fine-grained knowledge graph $G$ representing entities and their interrelations within and across tree nodes.
A mapping $M$ (GT-Link) associating each entity in $G$ with its provenance in $T$ .

The architecture consists of two phases:

Offline phase:

Document parsing via MinerU to extract primitive layout blocks with associated content, type (Title, Text, Table, Image), and layout features (font size, bounding box).
Section-level filtering and level assignment using LLM prompting to recover hierarchy.
Assembly of blocks into a tree structure based on document order and detected hierarchy levels.
Per-node entity and relation extraction via LLM (for text) or VLM (for images), producing subgraph fragments further resolved and merged by a gradient-based entity resolution algorithm.
Embedding of textual and visual content nodes; storage in a vector database for similarity search.

Online phase:

At query time, an agent classifies query type and constructs an operator plan to traverse, filter, and rank relevant nodes in $(T,G,M)$ , culminating in LLM-based answer synthesis.

2. BookIndex Construction: Hierarchy and Representation

2.1 Hierarchical Tree Extraction

Each document is layout-parsed into blocks $b_i = (c_i, \tau_i, f_i)$ , where $c_i$ is content, $\tau_i$ the block type, and $f_i$ the layout metadata.
Candidate heading blocks ( $\tau_i = \text{Title}$ ) serve as input to an LLM, queried with block content and context window, for annotation of true hierarchical level $\ell_i$ and refined type $\tau_i'$ (e.g., to distinguish semantic headers from misclassified elements).
Assembled nodes $n_i$ are arranged into a tree $T = (N, E_T)$ by anchoring each node to the closest prior node at level $\ell_i - 1$ in document order.

2.2 Embedding and Node Mapping

Content and title nodes ( $\tau_i' = \text{Text}$ or $\text{Title}$ ) are embedded using text models (e.g., Qwen3-Embedding-0.6B); nodes containing images or formulas are processed by multi-modal models (e.g., gme-Qwen2-VL-2B).
Similarity searches utilize cosine similarity in the embedding space:

$\mathrm{sim}(u,v) = \frac{u\cdot v}{\|u\|\|v\|}$

These embeddings support both semantic retrieval (content scent) and mapping of query entities to location(s) in the hierarchy for subsequent information extraction.

3. Fine-Grained Entity-Graph Construction

3.1 Per-node Entity and Relation Extraction

Each tree node $n_i$ is processed (via LLM or VLM) to extract entities $V_i$ and intra-node relations $E_{r_i}$ . For tables, special Table-type vertices and "ContainedIn" edges encode schema-level relationships.
Every entity $v$ inherits a provisional mapping $M(v) = \{n_i\}$ to its node of origin.

3.2 Gradient-based Entity Resolution (ER)

To consolidate entity aliases, BookRAG performs gradient-based ER:
- A new entity $v_n$ retrieves top- $k$ candidate matches $E_c$ from the vector DB via embedding similarity.
- Candidates are reranked; the process iteratively adds candidates to the merge set $Sel$ as long as the similarity drop remains below threshold $g \approx 0.6$ .
- If all candidates are included, $v_n$ is a new concept. Otherwise, the LLM is consulted for canonical selection if $|Sel| > 1$ . $M$ is updated accordingly.

3.3 Final Knowledge Graph

The final $G = (V, E_G)$ represents resolved entities and their aggregated relations across document granularity.
Edge weights are not explicitly assigned; PageRank is computed on subgraphs for multi-hop importance estimation:

$I_G = \mathrm{PageRank}(G', e)$

where $G'$ is a query-relevant induced subgraph.

4. Agent-Based Query Planning and Retrieval Workflow

BookRAG uses an IFT-inspired LLM agent for dynamic classification and workflow orchestration:

4.1 Query Categorization

Each query $q$ is labeled as "single-hop," "multi-hop," or "global" by a dedicated LLM prompt; no probability scores are employed.

4.2 Operator Library and Retrieval Plan

The retrieval process is modularized into operator classes:

Formulator:
- Decompose: $Q_s = \mathrm{LLM}(P_{Dec}, q) = \{q_1, ..., q_k\}$
- Extract: $E_q = \mathrm{LLM}(P_{Ext}, q) = \{e_1, ..., e_m\}$
Selector:
- Filter_Modal, Filter_Range: $N_f = \{n \in N | C(n)\}$
- Select_by_Entity: $N_s = \bigcup_{e \in E_q} \text{SubtreeRootsLinked}(e)$
- Select_by_Section: Relevant section titles and descendant nodes chosen by LLM.
Reasoner:
- Graph Reasoning: $S_G(n) = \sum_{v \in M^{-1}(n)} I_G(v)$
- Text Reasoning: $S_T(n)$ assigned by LLM ranking of retrieved nodes.
- Skyline_Ranker: Retain the Pareto frontier under $(S_G, S_T)$ .
Synthesizer: Aggregates and condenses the selected node content for LLM answer generation.

4.3 Retrieval Plans by Query Type

Query Type	Plan Structure
Single-hop	Extract → Select_by_Entity or Select_by_Section → (Graph ∥ Text) → Skyline → Reduce
Multi-hop	Decompose → Single-hop plan per sub-query → Map → Reduce
Global	(Filter_Modal ∥ Filter_Range)* → Map → Reduce

Empirical default parameters: gradient-ER threshold $g \approx 0.6$ , retrieval top- $k = 10$ .

4.4 Workflow Pseudocode

def answer(q):
    c = LLM_classify_query(q)
    P = Agent_plan(q,c)
    N_s = run_selectors(P.selectors)
    S_G = PageRank+GT_link(N_s, e)
    S_T = rerank_by_text(N_s, q)
    N_R = skyline_frontier({(S_G[n],S_T[n]) for n in N_s})
    return Synthesizer(q, N_R)

5. Experimental Evaluation on Long-Form QA

5.1 Datasets

BookRAG is evaluated on challenging document QA benchmarks:

Dataset	Doc Count	Avg Pages	Images/Figures	QA Pairs
MMLongBench	85	42	26	669
M3DocVQA	500	8.5	3.5	633
Qasper	192	11	3.4	640

5.2 Evaluation Metrics

QA Accuracy (Inclusion-Accuracy): $\frac{1}{N}\sum_i \mathbb{I}(\text{gold}\subseteq\text{generated})$
Exact Match (EM): $\frac{1}{N}\sum_i \mathbb{I}(\hat y_i = y_{\text{gold},i})$
F1 (token overlap):

$P = \frac{|T_{\hat y} \cap T_{\text{gold}}|}{|T_{\hat y}|}, \quad R = \frac{|T_{\hat y} \cap T_{\text{gold}}|}{|T_{\text{gold}}|}, \quad F1 = \frac{2PR}{P + R}$

Retrieval Recall (block-level):

$\mathrm{Recall}_{ret}(q) = \frac{|\mathcal{B}_{ret} \cap \mathcal{B}_{gold}|}{|\mathcal{B}_{gold}|}$

5.3 Results

Dataset	BookRAG EM / F1	Next Best Baseline EM / F1	BookRAG Retrieval Recall	Next Best Retrieval Recall
MMLongBench	43.8 / 44.9	27.5 / 28.6	57.6%	26.4%
M3DocVQA	61.0 / 66.2	43.0 / 47.8	71.2%	44.5%
Qasper	55.2 / 61.1	42.3 / 50.4	63.5%	33.5%

BookRAG also achieves substantial efficiency gains: average tokens per query $< 5$ M versus DocETL's $>53$ M, and latency is roughly $2\times$ faster than DocETL. These improvements are consistent across metrics, though no formal significance tests are reported (Wang et al., 3 Dec 2025).

6. Computational Cost and Scalability

BookRAG's computational profile is characterized by:

Indexing: Tree construction and per-node entity and relation extraction scale linearly with the number of blocks, $O(M)$ . Gradient ER for a new entity is $O$ (top $_k$ ), typically much less than $O(|V|)$ for graph-wide batch ER.
Query-time: Operator filtering and tree traversal is $O(|N|)$ , mitigated by early pruning. PageRank is only computed on small subgraphs ( $\ll |V|$ ), and skyline ranking over $k \approx 10$ candidates is negligible ( $O(k^2)$ ).
Empirical Runtimes: Query response time ranges from $10^2$ to $10^3$ seconds (depending on document size), on par with graph-based RAG and substantially faster than ETL-style systems. Token usage is lower by an order of magnitude.

A plausible implication is that BookRAG's hybridization of structure (hierarchical tree) and connectivity (knowledge graph), in conjunction with IFT-driven agent workflows, enables scaling to increasingly complex long documents without incurring prohibitive inference or indexing costs.

7. Context, Limitations, and Significance

BookRAG advances retrieval-augmented QA in domains where document logic and entity linkage are prominent—e.g., technical books, manuals, and scientific proceedings—by unifying hierarchical and entity-centric views. This offers notable improvements in evidence selection and answer generation over previous flat or layout-based RAG paradigms. No explicit edge weights in the KG and reliance on PageRank for multi-hop importance suggest that incorporating further link analytics or document-type-aware heuristics may yield additional gains. The absence of explicit significance testing constrains the statistical interpretation of improvements; however, the magnitude and consistency of the gains across datasets underscore BookRAG's practical value in structured-document QA (Wang et al., 3 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

BookRAG: A Hierarchical Structure-aware Index-based Approach for Retrieval-Augmented Generation on Complex Documents (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to BookRAG Framework.