AnnoRetrieve: Structured Semantic Retrieval
- AnnoRetrieve is a structured retrieval paradigm that transforms unstructured documents into attribute–value pairs via automated, schema-driven annotation.
- It leverages components like SchemaBoot, an annotation module, and a Structured Semantic Retrieval engine to convert natural language queries into efficient, index-based filters.
- Empirical evaluations demonstrate its superior effectiveness with an F1 score of 0.87, reduced LLM token costs, and lower query latency compared to traditional approaches.
AnnoRetrieve is a structured retrieval paradigm developed for efficient and precise information extraction from large corpora of unstructured documents, particularly targeting enterprise and web-scale textual data. Departing from traditional embedding-based and graph-based retrieval methods, AnnoRetrieve orchestrates fully automated schema induction, schema-driven annotation, and hybrid structured-semantic query execution to dramatically reduce reliance on LLMs while delivering high accuracy and scalability (Lin et al., 3 Apr 2026).
1. Motivation and Background
Conventional retrieval systems for unstructured text commonly adopt either embedding-based vector search or LLM-intensive knowledge graph construction. The former (“Retrieve-then-Extract”) operates by dense vector representation and coarse-grained similarity, followed by attribute extraction via LLMs—a process that is computationally costly and susceptible to imprecise filtering, resulting in excessive LLM invocations. The latter (“Extract-All-then-Query”) front-loads large corpus-wide extraction with LLMs to build knowledge graphs, incurring substantial up-front cost and sensitivity to extraction errors.
The central insight of AnnoRetrieve is to effect a one-time, schema-driven annotation: each document is decomposed into attribute–value pairs according to a data-driven, automatically induced schema. Query execution then bypasses both vector similarity and graph traversal, leveraging this structured representation. This annotation-driven approach enables highly targeted, index-based filtering and efficient attribute-level extraction, simultaneously eliminating the cost bottlenecks and accuracy issues pervasive in previous paradigms (Lin et al., 3 Apr 2026).
2. System Architecture and Pipeline
AnnoRetrieve integrates three principal modules, each corresponding to a distinct phase in the analysis pipeline:
- SchemaBoot: An automated schema induction system that clusters the document corpus, discovers multi-granularity field patterns (document-level, semantic, and detailed/content fields), and formulates candidate annotation schemas. An explicit optimization criterion balances schema coverage, discriminativity, annotator consistency, and query matching, with constraints on schema complexity and annotation workload.
- Annotator: Utilizes a schema-driven information extraction (IE) model (specifically, GliNER2) to materialize structured annotations for each document. “Fast” fields (for filtering) are stored in relational tables, while semantic and detail fields are managed in a document (JSON) store.
- Structured Semantic Retrieval (SSR) Engine: Translates natural language queries to structured predicates constrained by the induced schema. Combines index-based filtering (on “fast” and semantic fields) and a deterministic EXTRACT operator (regex or text-scan) to satisfy any residual conditions, returning precise sets of document IDs and extracted attribute tuples without further LLM usage on the critical path.
The stepwise pipeline is as follows:
| Step | Description | Output |
|---|---|---|
| 1. Input Corpus | Text document collection 𝒟 = {d₁,…,d_N} | Raw document dataset |
| 2. SchemaBoot | Cluster to 𝒞, discover patterns, optimize schema S* | Optimal schema |
| 3. Annotator | IE model (GliNER2), per-field extraction, annotation tables/stores | Annotated store 𝒜 |
| 4. SSR Engine | NL query → structured predicate, indexed filter/EXTRACT/SQL plan | Structured retrieval |
3. Schema Induction via SchemaBoot
SchemaBoot operates by multi-granularity pattern discovery and constraint-based multi-objective optimization, addressing schema quality in four dimensions:
- Coverage (Cov): Fraction of documents for which all schema fields are populated.
- Discriminativity (Disc): Average information gain for fields among document clusters.
- Consistency (Cons): Annotator agreement, quantified by Fleiss’ κ.
- Query Match (Match): Embedding similarity between historical query fragments and field semantics.
The overall schema quality function is:
with constraints on schema tree depth, branching factor, annotator time per document, and storage overhead.
Optimization is performed using NSGA-II to approximate the Pareto front, scalarized primarily on . This process obviates the need for manual schema engineering, instead automatically balancing annotation efficiency, schema informativeness, and downstream retrieval quality (Lin et al., 3 Apr 2026).
4. Structured Semantic Retrieval (SSR) Engine
The SSR engine defines a formal retrieval function over (query, annotation store, schema):
where encompasses both direct, schema-indexed predicates and on-demand text extraction.
Key components:
- Semantic Parsing to Structured Query: NL queries are mapped to field-level constraints .
- EXTRACT Operator: For conditions not indexable—either virtual/derived fields or residual text spans—EXTRACT applies regex/text-scan over raw text without invoking LLMs.
- Progressive Query Plan: Preliminary indexed filtering narrows candidate documents (), followed by EXTRACT-driven refinement and attribute-value projection.
SSR delivers low-latency retrieval, with cost per query dominated by index lookups and light-weight scans, several orders of magnitude more efficient than vector search () or typical knowledge graph traversal (Lin et al., 3 Apr 2026).
5. Empirical Evaluation and Comparative Analysis
Experiments span legal (LCR), multi-domain (WikiText), and web-page (SWDE) datasets, each characterized by complex, noisy, or heterogeneous schema structure. Evaluation covers a spectrum of query types: attribute filters, joins, and SQL-style reasoning.
| Method | F1 | LLM Cost (K tokens) | Latency (s) |
|---|---|---|---|
| VectorDB | 0.41 | 15.2 | 1.8 |
| Graph RAG | 0.58 | 142.5+8.7 | 12.4 |
| ZenDB | 0.65 | 45.6 | 6.1 |
| Palimpsest | 0.70 | 38.9 | 5.3 |
| QUEST | 0.78 | 22.1 | 4.0 |
| ClosedIE | 0.49 | N/A | 0.5 |
| GPT-4 | 0.72 | 312.8 | 24.7 |
| AnnoRetrieve | 0.87 | 29.4 | 3.2 |
AnnoRetrieve achieves the highest F1 (0.87), comparable or better than all baselines, with significantly lower LLM token consumption and system latency. Scalability studies indicate sub-linear query-time growth and robust performance on complex join queries. SchemaBoot-derived schemas surpass LLM-generated ones, both in annotation F1 and downstream retrieval effectiveness (Lin et al., 3 Apr 2026).
Ablation studies confirm the necessity of both SchemaBoot and SSR: manual schemas or exclusive reliance on vector retrieval markedly degrade F1 and efficiency.
6. The Retrieve-and-Verify Paradigm for Structure Annotation
For column annotation (CTA/CPA) in semi-structured data such as tables, AnnoRetrieve draws on the “Retrieve-and-Verify” approach (REVEAL and REVEAL+) (Ding et al., 24 Aug 2025):
- REVEAL: Utilizes maximal marginal relevance (MMR) to select a K-size, semantically diverse context set for each target column, leveraging pretrained semantic embeddings.
- Context-Aware Encoding: BERT-style role embeddings distinguish target and context columns. Models are trained on (target, context) pairs with softmax prediction heads.
- REVEAL+: A verification model (MLP atop same encoder) refines the context set, formulating a classification task over subsets to maximize downstream annotation accuracy. Top-down subset inference limits computational overhead.
On six real-world table annotation benchmarks, REVEAL+ achieves state-of-the-art accuracy and stability, even as table width increases—a regime where prior baselines collapse in performance (Ding et al., 24 Aug 2025).
7. Discussion, Limitations, and Future Directions
AnnoRetrieve establishes a paradigm shift away from embedding-first and graph-centric practices towards structure-first annotation and schema-driven retrieval. This yields substantial efficiency and accuracy gains, synergistically leveraging optimized schema construction and hybrid retrieval plans to constrain LLM usage.
Current limitations include a focus on static, text-centric corpora; extension to multi-modal (image, table, document) domains and dynamic, incrementally evolving corpora remains open. Further, integrating adaptive extraction (e.g., lightweight learned EXTRACT modules) and scaling distributed indexing/execution frameworks are prospective research directions (Lin et al., 3 Apr 2026).
In the context of annotated semantic web data, AnnoRetrieve’s principles resonate with formal frameworks for annotated RDFS and compound annotation domains, as well as SPARQL-style query systems with annotation-aware semantics (Zimmermann et al., 2011). The unification of automated schema induction, annotation-driven reasoning, and structured query execution in AnnoRetrieve represents a significant contribution to cost-effective, precise, and scalable document analysis.