SINR Framework in RAG Systems
- SINR is a dual-layer architecture that separates fine-grained semantic search from coarse-grained retrieval to optimize both precision and context assembly in language models.
- The framework employs 150-token search chunks and 600–1,000-token retrieve chunks with a deterministic parent mapping for efficient and scalable information retrieval.
- Empirical results reveal improved recall, coherence, and reduced query latency, demonstrating SINR’s practical advantages for modern RAG implementations.
The Search-Is-Not-Retrieve (SINR) framework is a dual-layer architecture for information retrieval within modern Retrieval-Augmented Generation (RAG) systems. SINR rigorously distinguishes between the localizing of relevant information (“search”) and the assembly of contiguous, contextually sufficient evidence (“retrieval”). This separation enables retrieval architectures to optimize both precision and contextuality in constructing LLM prompts, thereby enhancing composability, scalability, and context fidelity—all without incurring additional processing costs (Nainwani et al., 7 Nov 2025).
1. Conceptual Foundations
Traditional dense-vector retrieval systems typically employ a monolithic chunking of source documents, embedding either entire documents or fixed-size windows as single vectors into a common space. This conflates two distinct tasks:
- Semantic search: pinpointing minimal spans of text closest to a query’s intent (high granularity, e.g., 100–200 tokens/chunk);
- Contextual retrieval: supplying contiguous text necessary for the downstream model’s reasoning (coarse granularity, e.g., 500–1,000 tokens/chunk).
Short window chunking provides precision at the expense of context continuity, whereas long windows preserve context but dilute the search’s localization power. The SINR framework’s foundational insight is that these two objectives—finding and assembling—are optimally achieved using different representational granularities. Fine-grained “search chunks” enable accurate semantic matching, while coarse-grained “retrieve chunks” provide the narrative or topical completeness essential for model reasoning. In human analogy, SINR mirrors the cognitive process of first skimming to locate a salient fact and then inspecting the surrounding text for full context.
2. Formal Structure
Let denote the set of fine-grained search chunks (∼150 tokens each), and denote the set of coarse-grained retrieve chunks (∼600–1,000 tokens each) with . The framework defines:
- A search embedding function , mapping each search chunk to its vector representation ;
- A query embedding in the same space;
- A deterministic parent map such that every belongs to a unique .
Retrieval unfolds in two steps:
- Semantic Search: Select the top- search chunks against the query:
- Context Assembly: For each , fetch parent retrieve chunks:
The final context presented to the LLM is the union , usually ordered by document or descending search relevance.
This yields a clean decoupling of optimization priorities: minimize search distance in embedding space for semantic match, and maximize a contextuality measure () for chunk sufficiency (e.g., length, syntactic cohesion, topic coverage).
3. Architecture and Query Pipeline
The SINR query pipeline is explicitly dual-layer:
- Query Embedding: Map the incoming query to its vector .
- ANN Search: Execute approximate nearest-neighbor search over (e.g., via FAISS-HNSW or Milvus) to surface the top- .
- Parent Lookup: For each , obtain retrieve chunk via fast id-based mapping .
- Context Assembly: Deduplicate and concatenate the resulting .
Pseudocode for the retrieval pipeline:
1 2 3 4 5 6 |
def SINR_Retrieve(query, index_S, map_parent, R, k): q_vec = f_embed(query) S_hits = ANN_Search(index_S, q_vec, top_k=k) parent_ids = { map_parent[s.id] for s in S_hits } R_hits = unique(R[j] for j in parent_ids) return concatenate(R_hits) |
4. Scalability and Complexity
For a corpus partitioned into search chunks and retrieve chunks, SINR exhibits the following scaling behavior:
- Per-query cost: Approximately , where is the number of search chunks initially retrieved.
- Storage: for the search embedding index, for chunk–parent mapping, and for textual storage; mapping overhead is ≤2% of embedding storage at billion-scale.
- Update efficiency: Changes to one document require only re-embedding its local search chunks (~100 ms) and updating parent map pointers (~1 ms); no corpus-wide re-indexing is needed.
- End-to-end latency: Empirically 50–100 ms on 10 million chunk corpora for embedding, search, parent mapping, fetch, and assembly.
5. Practical Implementation Guidelines
Chunking Strategy:
- Retrieve chunks align with document logical boundaries (e.g., paragraphs or sections).
- Search chunks are generated using a sliding window of width tokens, stride tokens (≈33% overlap).
Embedding and Indexing:
- Search vectors of dimensionality or 1024 are stored in fast vector indices (FAISS-HNSW, Milvus).
Parent Mapping:
- Implemented as a hash table , often resident in Redis or as metadata inline with the vector index.
Serving Pipeline:
- The operational flow is: embed → search → map → fetch text → assemble → prompt LLM.
SINR is directly compatible with established RAG stacks such as LangChain and LlamaIndex. Integration involves substituting existing chunking and retrieval modules for the two-layer SINR equivalents.
6. Empirical Outcomes
In comparative studies with conventional RAG architectures that use uniform 500-token chunks, SINR demonstrates the following qualitative and quantitative improvements:
| Metric | Improvement with SINR |
|---|---|
| recall@20 | +15–25% (semantic precision) |
| Measured coherence | +30% (contextual completeness) |
| Average context size | 2.5K → 8K tokens (richer reasoning) |
| Index storage | −40–60% (smaller chunk count) |
| Query latency | −20–30% |
Human evaluation indicates fewer fragmented answers, enhanced justification chains, and smoother narrative flux in LLM outputs.
7. Integration within Retrieval-Augmented Generation
SINR integrates into RAG stacks by decoupling “where to look” (search) from “what to read” (retrieve and assemble). The modular interface is:
User Query → Query Embedder → SINR Search Index → Parent Mapping → SINR Retrieve Store → Context Assembly → LLM Prompt → Model Answer
For example, addressing a question such as “How are warranty claims handled for product X?” involves precisely retrieving small search chunks containing salient terms, then upscaling to parent policy sections that contain the requisite procedural and jurisdictional information. This yields contextually self-sufficient, citation-traceable answers.
By enforcing this duality, SINR realizes both precision and context without accruing additional system costs, scales linearly with corpus size, allows for instant updates, and supports transparent, inspectable retrieval chains for model debugging and interpretability.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free