Page-Level Question Answering

Updated 27 September 2025

Page-level question answering is defined as extracting precise answer spans from specific document segments using advanced IR methods and contextual reasoning.
Systems employ hybrid retrieval strategies such as dense and sparse embeddings, neural rankers, and hierarchical indexing to efficiently locate relevant passages.
Challenges include managing multi-modal content, cross-sentence dependencies, and noisy contexts while ensuring evidence alignment and high answer accuracy.

Page-level question answering (QA) encompasses information retrieval, passage ranking, and answer extraction within defined units of text—typically the scale of a page, paragraph, or bounded region in a document. Systems for this task must manage substantial contextual complexity, including cross-sentence dependencies, multi-modal content, and variable document structure. Advances in page-level QA integrate retrieval-augmented generation, dense and sparse indexing, neural ranking, and context-aware extraction to deliver direct answers aligned with the evidence on a specific page.

1. Foundational Principles and Definitions

Page-level QA systems address the task: given a natural language question $q$ and a structured or unstructured corpus, locate a concise answer $a$ within the correct segment (often a page or passage) without requiring the user to manually navigate large numbers of documents. Early systems such as Answer Finder (Derczynski et al., 2013) formalized the distinction between returning a ranked list of possibly relevant documents and extracting the answer span or value directly from the corpus. The QA function can be defined as: $\hat{a} = \arg\max_{a \in A} p(a|q,C)$ where $C$ is the set of candidate contexts (pages, paragraphs).

Recent methodologies explicitly model hierarchical context, leveraging passage-level and page-level indexes, and disambiguate user intent using deep language understanding. Both knowledge graph QA and free-text QA pipelines are integrated to increase coverage and context alignment (Guo et al., 2024).

2. System Architectures and Retrieval Strategies

The evolution of page-level QA incorporates several retrieval paradigms:

Plug-in IR Engines and Passage Retrieval: Early frameworks compared Lucene, Indri, and Terrier for passage and document-level indexing. Passage-level retrieval, leveraging delimiters such as paragraph tags, yielded higher QA accuracy than document-level approaches due to reduced noise (Derczynski et al., 2013).
Neural Retriever-Reader and Reader-Retriever Paradigms: The reader-retriever model preprocesses the corpus offline to build question–answer bipartite graphs, enabling efficient online retrieval of both natural language and structured queries (Xiao et al., 2020).
Dense and Sparse Embedding-based Retrieval: Dual-encoder dense retrievers (e.g., DPR) and cross-encoders (e.g., ELECTRA) are leveraged for robust context selection, augmented by BM25 for term-based ranking (McDonald et al., 2022).
Retrieval-Augmented Generation (RAG) and Hierarchical Indexing: RAG frameworks with hierarchical indexing structures enable retrieval at variable granularity—flattened chunks for in-page details and clustered blocks for cross-page dependencies (Gong et al., 1 Aug 2025). Graph-enhanced RAG (GraphRAG) models interconnected entities for complex reasoning, but may introduce excessive noise when aligning page-level references due to broad context retrieval (Chen et al., 20 Sep 2025).

Retrieval Style	Index Level	Context Granularity
BM25 / Lucene (baseline)	Passage, Document	Token, Paragraph, Page
Dual-encoder DPR	Passage, Page	Dense embeddings, whole pages
Reader-Retriever	QA Space Graph	Question-answer bipartite nodes
RAG/GraphRAG	Page, Chunk, Graph	Embeddings or entity graph nodes

3. Ranking and Reformulation Modules

Ranking functions are pivotal in discerning relevant passages, especially when fine-grained answer extraction is required:

Neural Rankers for Passage Selection: Semantic similarity and keyword relevance are modeled via InferSent and Relation-Networks rankers, respectively, as in (Htut et al., 2018). The former computes joint embedding-level similarity; the latter performs exhaustive pairwise word-level scoring. Margin ranking loss is used to encourage discrimination between answer-containing and negative passages: $L = \sum_{i=1}^k \max(0, 1 - f(q, p_{pos}) + f(q, p_{neg}^i))$
Gold Standard Reformulation and Evaluation: Manual reformulation of question series established guidelines for improving independence and clarity of questions. Reformulation quality was assessed using weighted n-gram similarity metrics—specifically, $S = 2 \times \text{Sim}_{\text{unigram}} + \text{Sim}_{\text{bigram}}$ —to optimize preprocessing for later retrieval steps (Derczynski et al., 2013).

Recent page-level QA systems address the challenge of connecting textual, tabular, and visual evidence, and aggregating information scattered through multi-page documents:

Multi-modal RAG Architectures: MMRAG-DocQA (Gong et al., 1 Aug 2025) employs a hierarchical index federating in-page chunks and cross-page summaries via Gaussian mixture clustering and LLM-based summarization. Both page-level parent-page retrieval and document-level summary retrieval support integration of disparate modalities.
Visual-Only Document Question Answering: Self-attention scoring over Pix2Struct features enables OCR-free retrieval on large multi-page PDF and scanned document sets, scaling to 800 pages without GPU bottlenecks. Each page is rendered into a feature sequence and its relevance scored with a self-attention head and aggregation: $S = \sigma(W_3(\text{drop}(W_2(\text{drop}(W_1(h))))))$ (Kang et al., 2024).
Coreference-aware Question Generation: Neural QG models incorporate refined coreference embeddings via gating mechanisms and position features to enable cross-sentence reasoning and more answerable question generation at the passage or page level (Du et al., 2018).

5. Evaluation, Challenges, and Limitations

Evaluation frameworks for page-level QA encompass both accuracy and quality of context selection:

Retrieval Metrics: Top-k retrieval accuracy measures whether the correct page or passage is selected, while F1 quantifies answer overlap with ground truth. For example, dense embedding-based RAG approaches attained top-1 accuracy of 0.686 on math textbooks, outperforming GraphRAG’s entity-based graph retrieval (Chen et al., 20 Sep 2025).
Context Coverage and Hallucination: Metrics such as context coverage (CCov), grounded accuracy, and hallucination rate are implemented to measure whether sufficient supporting evidence is retrieved and answers are factually substantiated (Tangarajan et al., 4 Aug 2025).
Aggregation and Fusion Strategies: Systems such as AnswerQuest (Roemmele et al., 2021) and reader-retriever hybrids (Xiao et al., 2020) combine candidate answers from multiple retrieval pathways via consistency checks or majority voting to maximize recall while minimizing spurious results.
Known Limitations: Performance plateaued at 60–64% accuracy using traditional IR engines, reflecting the challenge of processing ambiguous or noisy queries and the labor-intensive nature of constructing extensive gold standards (Derczynski et al., 2013). Blind relevance feedback using TF often degraded retrieval coverage (Derczynski et al., 2013). Entity-graph retrieval may introduce excessive context and hallucinate references without page-aligned constraints (Chen et al., 20 Sep 2025).

6. Practical Applications and Domain Adaptation

Page-level QA engines support applications across numerous domains:

Web and Enterprise Search: Integrated KG and free-text QA deliver direct answers with rich metadata and aggregation, suited to exclusively indexed corpora such as Wikipedia and Wikidata (Guo et al., 2024).
E-commerce and Context-Aware PQA: RAG frameworks leveraging user profiles, conversational history, and product attributes deliver personalized, context-sensitive answers while identifying catalog information gaps. Objective, subjective, and multi-intent queries are resolved via unified intent modeling and entropy-guided retrieval (Tangarajan et al., 4 Aug 2025).
Education and Tutoring Systems: Embedding-based retrieval enables reference to specific textbook pages, increasing explainability and alignment for AI tutoring, while graph-based retrieval (GraphRAG) enhances semantic linkage but may reduce page-level fidelity (Chen et al., 20 Sep 2025).
Scientific and Technical QA: DRC frameworks combining layout-aware detection, neural retrieval, and multi-format QA models support robust text and evidence extraction in complex documents (McDonald et al., 2022).

7. Trends, Interactivity, and Future Directions

Emerging trends highlight the convergence of conversational interfaces and page-level QA, multimodal evidence integration, and context-sensitive interaction:

Interactive QA Systems: Literature synthesis formalizes the interactive QA task, including disambiguation and exploration by iterative refinement, and multi-turn dialogue involving structured knowledge graphs and unstructured documents (Biancofiore et al., 2022). Evaluation increasingly involves both offline (benchmarks) and online (user studies) modalities.
Scalability and Multimodality: Research suggests leveraging layout-aware, cross-modal, and hierarchical retrieval for more robust evidence aggregation. Adaptive zero-shot or weakly supervised learning methods are prioritized for domain extension (McDonald et al., 2022).
Interpretability and Attributed Generation: Attributed sequence-to-sequence generation with interpretable search paths (as in 1-PAGER) is being explored for transparent QA where evidence can be traced directly to supporting passages (Jain et al., 2023).

In summary, the state of page-level question answering ranges from modular IR enhancements and neural rankers to complex RAG systems, capable of context-aware, multi-modal, and scalable QA across diverse technical domains. Advances in hierarchical indexing, aggregation strategies, multimodal evidence retrieval, and interactive dialogue systems continue to drive improvements in both accuracy and utility, setting challenges for future research in robust, interpretable, and domain-adaptive QA.