End-to-End Extractive Readers

Updated 4 June 2026

End-to-end extractive readers are agentic systems that combine LLMs with closed-loop retrieval to extract and verify factual evidence without intermediate supervision.
They employ iterative precision–recall strategies and index-time optimization to enhance discoverability, achieving up to 93.7% retrieval and improved citation rates.
Empirical results indicate these readers improve factual accuracy and reduce high-confidence errors, though challenges in evidence navigation and context management persist.

End-to-end extractive readers are a class of information-seeking systems that leverage LLMs and retrieval components in a closed-loop, agentic architecture to optimize precision and recall in evidence identification, synthesis, and factual correction. These readers operate without reliance on intermediate supervision or cascading task-specific stages, contrasting with modular or pipeline extractive methods. Recent advances center on rigorous agentic precision–recall iteration, index-time optimization, and explicit criteria for discoverability and reliability in complex application domains, including research assistance, enterprise knowledge bases, and clinical planning.

1. Formal Definition and Core Principles

An end-to-end extractive reader is instantiated as an agentic stack in which a query $q$ triggers a retrieval-augmented generation (RAG) loop over a static or dynamically updated index $\mathcal{I}$ , returning a deterministic set of relevant documents or factual units. The canonical retrieval operation is defined as

$R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$

where $s(q, d) \in [0, 1]$ denotes a calibrated cross-encoder or ranker score and $\tau$ is a fixed threshold. The extractive reader emits, in a single pass, the top- $k$ results exceeding $\tau$ , often accompanied by fine-grained evidence chains, explicit confidence estimates, and/or self-contained "factual nuggets"—atomic KB entries encoding a correction or fact, along with anchors and minimal disambiguation (Hazoom et al., 25 May 2026, Hsu et al., 11 May 2026).

Core to the end-to-end paradigm is agentic iteration: the reader, or its LLM-powered agentic controller, directly manages retrieval depth, evidence surfacing/browsing, and synthesis, adapting tool usage to optimize downstream objectives (e.g., discoverability, answer accuracy, guideline compliance) without human in-the-loop postprocessing.

2. Agentic Precision–Recall Iteration

Precision–recall optimization is fundamental to all recent end-to-end extractive reader designs. Systems implement iterative procedures that refine the indexed units (e.g., factual nuggets, knowledge entries) or the agent's search policy (e.g., via regeneration, reformulation, or discrepancy buffering), directly using the production retrieval + generation stack as an evaluation harness. At each iteration $t$ :

Discoverability (Recall):

$\mathrm{Disc}_t = \frac{1}{|\mathcal{Q}|}\sum_{q \in \mathcal{Q}} \mathbf{1}[n^\star \in R_t(q)]$

with $\mathcal{Q}$ a set of paraphrastic or semantically related probe queries.

Precision is 1 when the correct unit $\mathcal{I}$ 0 is the lone relevant retrieval, otherwise 0.
Citation rate can serve as a practical proxy for answer-level precision, denoting whether the retrieved fact is cited by the generator (Hazoom et al., 25 May 2026).

Iterative frameworks such as Iterative Nugget Optimization (INO) (Hazoom et al., 25 May 2026) repeatedly expand and rewrite indexed facts until the discoverability objective converges, while ensuring that offline costs remain negligible at runtime. In planning settings, twin architectures (e.g., Planner–Auditor (Wu et al., 28 Jan 2026)) audit recall via deterministic criteria and control precision through structured confidence estimation and feedback-driven regeneration.

3. Index-Time Optimization and Factual Nugget Construction

Index-time optimization decouples retrieval improvements from any changes in query-time latency, retriever architecture, or generator prompt composition. For factual corrections generated from user feedback, the INO loop (Hazoom et al., 25 May 2026) proceeds as follows:

Extract and draft a factual nugget $\mathcal{I}$ 1 from feedback on a trigger query $\mathcal{I}$ 2.
Augment $\mathcal{I}$ 3 with paraphrase anchors to create $\mathcal{I}$ 4.
Index $\mathcal{I}$ 5 and probe with $\mathcal{I}$ 6 and held-out paraphrases.
If $\mathcal{I}$ 7 is not retrieved, LLM-based reflection rewrites anchors, terminology, or context in $\mathcal{I}$ 8 without introducing new facts.
Iterate up to $\mathcal{I}$ 9 cycles (empirically, $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 0 suffices in >95% of cases).
Permanently index $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 1 when discovered by all probes.

This method significantly increases both retrieval and citation rates for corrections (e.g., up to 93.7% retrieval and 86.1% citation on paraphrastic queries; see section 5 table for quantitative results). Negative-control experiments show minimal loss of retrieval precision (<1.5% false positive rate), ensuring factual tightness of indexed units (Hazoom et al., 25 May 2026).

4. Agentic Evidence Search and Retrieval Depth

Agentic extractive readers, exemplified by the Pi-Serini loop (Hsu et al., 11 May 2026), model the document search-and-read process as a sequence of tool invocations parameterized by a ReAct-style agent history. The agent π alternates between searching, paginating result sets, reading documents, and emitting final structured answers, within a prescribed time budget. Key tool actions include:

search: queries an Anserini BM25 backend and returns both top-5 snippets for evidence sufficiency and a cached ranking up to a fixed depth ( $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 2 common for long-doc corpora).
read_search_results, read_document: enable the agent to inspect deeper or specific evidentiary content without initiating new searches.

Empirical ablations confirm that increasing retrieval depth directly increases surfaced-evidence recall: from $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 370.5% at $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 4 to 95.8% at $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 5, with previewed recall plateauing at moderate depth (indicating agent context or navigation bottlenecks). BM25 tuning (e.g., $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 6, $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 7) contributes up to 18 percentage points of additional answer accuracy (Hsu et al., 11 May 2026).

Table: Effect of Retrieval Depth on Evidence Recall (Pi-Serini, GPT-5.4)

Retrieval Depth $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 8	Surfaced Recall (evi)	Previewed Recall (evi)
5	70.48%	70.48%
50	86.16%	74.67%
1000	95.78%	70.89%

Maximal recall is achieved by exposing a broad candidate set for agentic reasoning, confirming that recall in retrieval is not the limiting factor for extractive evidence navigation (Hsu et al., 11 May 2026).

5. Twin Validation and Self-Improvement in Structured Layouts

In structured extraction scenarios (e.g., clinical discharge planning), end-to-end extractive readers are paired with deterministic audit modules to enforce task coverage and calibrate confidence. The Planner–Auditor framework (Wu et al., 28 Jan 2026) instantiates an agentic loop:

The Planner (LLM with retrieval capabilities) emits a structured plan and attaches a confidence estimate $R(q, \mathcal{I}) = \{ d \in \mathcal{I} : s(q, d) \ge \tau \}$ 9.
The Auditor deterministically checks for required content (e.g., coverage of four clinical plan categories).
If $s(q, d) \in [0, 1]$ 0 is low or coverage fails, the Planner regenerates; stubborn, high-confidence failures are buffered for asynchronous cross-episode replay.

This process progresses coverage (recall) from 32% (baseline) to 86% (self-improvement), and to 100% with buffer replay. Calibration metrics such as Brier score and expected calibration error (ECE) sharply decrease, reducing high-confidence omissions to 0% in the best configuration. System parameters ( $s(q, d) \in [0, 1]$ 1, buffer thresholds) steer the recall-precision trade-off (Wu et al., 28 Jan 2026).

Table: Empirical Impact of Twin Loop and Caching (Planner–Auditor, (Wu et al., 28 Jan 2026))

Configuration	Coverage	Brier	ECE	High-conf. errors
Baseline	0.32	0.544	0.564	66%
Self-Improve	0.86	0.126	0.062	14%
Buffer Replay	1.00	0.017	0.107	0%

6. Empirical Results and Practical Outcomes

Quantitative results across extractive reader frameworks establish several key outcomes:

INO substantially increases both retrieval and citation of corrective facts in knowledge-assistance agents; e.g., 29-point increases in both discoverability and citation rate versus standard implementations, and up to 77.3% retrieval on real non-paraphrastic queries (Hazoom et al., 25 May 2026).
Agentic search with Pi-Serini achieves surfaced-evidence recall up to 94.7% and answer accuracy of 83.1% on BrowseComp-Plus, outperforming dense retrieval baselines (Hsu et al., 11 May 2026).
Agentic validation loops halve error rates related to high-confidence omissions, with full task coverage achievable when buffer replay is enabled, at a moderate runtime cost (Wu et al., 28 Jan 2026).

Collectively, these results underline that end-to-end extractive readers, when equipped with iterative precision–recall optimization—especially at index time—can achieve high factual reliability and recall, even under substantial lexical and structural query variation.

7. Limitations and Future Research Directions

Despite significant empirical gains, extant end-to-end extractive readers face several limitations. Every system studied demonstrates a plateau in previewed or behavioral evidence recall, even as surfaced recall increases with retrieval depth: the agent bottleneck shifts from retrieval itself to evidence navigation, context allocation, and prompt consumption. The gap between surfaced and previewed recall suggests the need for further research in context management, tool-use policies, and cross-document reasoning. A plausible implication is that future systems will require optimized hybrid navigation strategies and adaptive memory allocation to bridge this gap (Hsu et al., 11 May 2026).

Another open area is robustness to distributional shift in feedback, user queries, and knowledge base evolution. INO and twin-loop architectures present frameworks for continual adaptation, but their long-term stability and sample efficiency under sustained adversarial input remain open to full characterization (Hazoom et al., 25 May 2026, Wu et al., 28 Jan 2026).

Finally, the translation of these frameworks from constrained B2B or clinical settings to open-domain scientific discovery and real-world research workflows remains an active research frontier, especially as LLM capabilities further evolve.