BookAsSumQA: Aspect-Based Summarization
- BookAsSumQA is a framework for aspect-based summarization that recasts evaluation as an automated QA task focused on thematic coverage.
- The method employs a three-phase workflow—knowledge graph extraction, aspect-specific QA generation, and QA-based summary evaluation using metrics like ROUGE, METEOR, and BERTScore.
- Comparative analysis shows that while LLM-based strategies perform better on shorter texts, RAG-based methods scale effectively for book-length documents.
BookAsSumQA is a framework for evaluating and constructing aspect-based book summarization systems via automated, precision-focused question answering. It advances evaluation methodology, system design, and practical workflows for summarizing massive narrative documents, systematically mining and testing coverage of specific thematic or topical aspects at scale.
1. Motivation and Historical Context
Aspect-Based Book Summarization (ABS) targets the extraction of summaries that foreground specific facets of a book—such as its treatment of romance, mystery, or character arcs—rather than compressing the entire narrative. Historically, ABS has been tractable in domains with short texts (e.g., product reviews) where human reference summaries can be curated for each aspect. For book-length documents (≥100k tokens), both generating multiple reference summaries and evaluating system outputs pose severe scalability and fidelity constraints. Standard metrics like ROUGE or METEOR require reference texts and ignore aspect-specific coverage, while reference-free alternatives do not directly check thematic completeness. BookAsSumQA addresses these by recasting evaluation as a focused Question Answering (QA) problem, bypassing human reference summaries and automating both question synthesis and summary validation (Miyazato et al., 9 Nov 2025).
2. Core Framework Components and Workflow
BookAsSumQA proceeds in three phases: knowledge graph extraction, aspect-specific QA pair generation, and evaluation via targeted QA metrics.
- Narrative Knowledge Graph Construction: The book is partitioned into overlapping character windows (typically 1,200 characters with 100 overlap). Using a LLM, entities (characters, events, concepts) and fragments of their interrelations are extracted to form a graph , where edges are annotated with a description , a keyword set , and a chunk-based importance score .
- Aspect-Specific QA Pair Synthesis: For each aspect of interest (e.g., "Mystery"), edges with high are selected. The LLM generates QA pairs from the edge descriptors and keywords. Relevance is scored via cosine similarity between the aspect embedding and , yielding a pool of aspect-aligned QA pairs per genre/theme.
- Summary Evaluation: Candidate aspect-based summaries (≤300 tokens) are evaluated by prompting an LLM to answer each using as context. The predicted answer is compared to using ROUGE-1, METEOR, and BERTScore F₁, providing a composite, fine-grained measure of how well the summary covers the intended aspect.
The workflow is reference-free: both QA pair generation and answer validation are model-driven, avoiding expensive human annotation and facilitating scale across 10–100 aspects and thousands of books.
3. System Architectures and Summarization Methodologies
BookAsSumQA supports multiple summarization backends, organized into two principal families:
- LLM-Based Summarizers: Hierarchical merging ("Hier") summarizes each chunk independently and recursively merges intermediate summaries; incremental updating ("Inc") maintains a rolling summary updated with each chunk. Models such as GPT-4o-mini and Llama-3.1-8B-Instruct are instantiated for these workflows.
- Retrieval-Augmented Generation (RAG): NaiveRAG indexes book chunks and retrieves those most relevant to the aspect query for summarization. GraphRAG constructs a knowledge graph, partitions it into communities, summarizes each, and merges; LightRAG retrieves substructures most relevant to queries from the graph.
All aspect summaries are length-constrained (≤300 tokens), ensuring consistency across methods and aspects. RAG methods index the book once and synthesize multiple aspect summaries efficiently, while LLM-based strategies regenerate distinct summaries per aspect.
4. Evaluation Metrics and Empirical Results
The evaluation is performed over 30 books (BookSum corpus), partitioned by length, and across 14 literary aspects (Genres: Fantasy, Romance, Comedy, Paranormal, Young Adult, Horror, History, Action, Science Fiction, Mystery, Adventure, Crime, Thriller, Poetry).
Aspect-QA Accuracy by Method:
| Method | ROUGE-1 | METEOR | BERTScore |
|---|---|---|---|
| Llama + Hier | 22.43 | 19.23 | 85.66 |
| GPT + Hier | 22.49 | 19.49 | 85.82 |
| Llama + Inc | 21.91 | 18.23 | 85.48 |
| GPT + Inc | 21.90 | 18.76 | 85.47 |
| NaiveRAG | 21.43 | 18.66 | 85.44 |
| GraphRAG | 14.66 | 13.56 | 84.50 |
| LightRAG | 20.61 | 18.41 | 85.51 |
Performance trends:
- GPT-4o-mini with Hierarchical Merging achieves the highest average scores.
- Hierarchical merging outperforms incremental updating within LLM-based methods.
- RAG methods (NaiveRAG, LightRAG) are competitive but display less decline in accuracy as book length increases.
- GraphRAG underperforms in ABS-QA, indicating that blanket graph partitioning can dilute aspect coverage.
Length-based breakdown (Only GPT+Hier vs. NaiveRAG):
| Book Size | GPT+Hier (ROUGE-1) | NaiveRAG (ROUGE-1) |
|---|---|---|
| Small | 25.66 | 22.09 |
| Middle | 21.95 | 21.95 |
| Large | 20.50 | 20.55 |
As book length increases, the advantage of LLM-based summarizers diminishes; NaiveRAG becomes equally effective and more scalable due to its retrieval precision.
5. Comparative Analysis: Efficiency and Scalability
LLM-based approaches incur compression noise with increasing document length, leading to diminished aspect coverage. Hierarchical merging partially mitigates this via repeated local context fusion, but loss is cumulative. RAG pipelines, by re-querying the indexed book for aspect-specific relevance, maintain high fidelity regardless of length and efficiently generate multiple aspect summaries per indexed book. This suggests RAG is preferable for novels or datasets exceeding 100k words, where lossless aspect retention is essential (Miyazato et al., 9 Nov 2025).
Furthermore, the pipeline structure facilitates batch QA pair generation and answer checking, making BookAsSumQA attractive for large-scale evaluation campaigns. The reference-free nature allows rapid benchmarking of diverse summarization architectures without manual labor.
6. Limitations and Prospective Research
BookAsSumQA is highly reliant on LLMs for both QA-pair synthesis and answer validation, imparting possible bias and dependency on the models' parametric knowledge. In practice, LLMs may hallucinate answers or over-credit incomplete summaries, and strict isolation from external world knowledge remains a challenge. No direct human-aligned benchmark comparison (versus metrics such as Q2 or BERTScore alone) has been performed. Further, improved graph construction (relation scoring, aspect-aware pruning) and summary context control (sandboxed answering) are open research avenues. Expanding to chapter or subplot-based retrieval for even finer granularity is an important area for future development.
A plausible implication is that as LLMs increase in context length and retrieval capabilities, BookAsSumQA will support deeper, multi-aspect, multi-hop evaluation scenarios, approaching comprehensive narrative understanding.
7. Conclusions and Future Impact
BookAsSumQA establishes a robust, scalable paradigm for evaluating and guiding aspect-based book summarization systems. By leveraging knowledge graph construction, automated QA generation, and precision QA evaluation, it obviates the need for human reference summaries and enables fine-grained, aspect-focused benchmarking. Key findings demonstrate that LLM-based methods excel on short texts, RAG-based methods become superior on longer novels, and aspect-based QA offers a more precise evaluative lens than previous approaches.
This suggests future systems for book summarization will increasingly employ QA-centric evaluation pipelines, potentially integrating joint retrieval-generation architectures, entity tracking, and multi-hop reasoning for richer aspect fidelity and coverage. The framework also motivates further inquiry into correlating automated QA metrics with human satisfaction, optimizing narrative graph structure, and refining answer checking to minimize external knowledge interference.