Curated RAG System Architecture

Updated 30 November 2025

Curated RAG system is a knowledge-centric architecture that integrates expert-curated documents, dynamic retrieval, and structured prompt engineering to deliver explainable LLM outputs.
It employs modular components such as a curated corpus, retriever, and orchestrator to ensure high-quality, updatable context for tasks like policy enforcement and biomedical QA.
The system offers scalable, transparent and domain-adaptable applications by updating vector stores without retraining models, improving performance in high-stakes environments.

A curated Retrieval-Augmented Generation (RAG) system is a knowledge-centric architecture that grounds the generation and/or classification capabilities of LLMs in a high-quality, explicitly maintained corpus of domain-specific, policy-vetted, or expert-authored documents. The approach shifts the paradigm from static, parameter-driven inference toward dynamic retrieval and contextual reasoning, enabling rigorous, updatable, and explainable outcomes across content moderation, biomedical QA, regulated finance, and analogous high-stakes classification or synthesis tasks (Willats et al., 8 Aug 2025).

1. Architectural Foundations

A curated RAG system comprises modular components, each designed to ensure that retrieved knowledge is both relevant and authoritative for the target domain.

Curated Knowledge Store: The corpus is composed of long-form, domain-specific documents segmented into overlapping chunks (e.g., 200–500 tokens, 50-token overlap) (Willats et al., 8 Aug 2025). Each chunk is uniquely identified and enriched with metadata (document ID, offsets, category tags). Curation—either manual or semi-automated—ensures quality, topicality, and completeness; updates require only manipulation of the vector store rather than model retraining (Willats et al., 8 Aug 2025).
Retriever: User input is transformed into a dense embedding $q$ , and each document chunk $d_i$ is similarly embedded. Retrieval relies on cosine similarity:

$s(i) = \frac{\vec{q} \cdot \vec{d}_i}{\|\vec{q}\|\|\vec{d}_i\|}$

Top- $K$ candidates are optionally reranked using a trained cross-encoder or instruction-tuned model to further refine contextual relevance (Willats et al., 8 Aug 2025).

Orchestrator and Generator: The orchestrator assembles a composite prompt comprising system directives, user query, and retrieved passages. A grounded LLM (e.g., Llama-3.3 preference-optimized) is tasked to adhere to and explicitly reference the retrieved policy or domain fragments (Willats et al., 8 Aug 2025).
Classifier Wrapper (where applicable): For structured outputs (e.g., classification), the generator’s output is post-processed into discrete fields (label, category, target identity, rationale). Confidence thresholds or log-probability calibration can be enforced to guarantee certainty levels (Willats et al., 8 Aug 2025).

This explicit separation of retrieval, reasoning, and curation facilitates rapid updates, traceable outputs, and robust alignment with evolving domain requirements.

2. Document and Policy Curation

Content curation in a curated RAG system hinges on source selection, document segmentation, and ongoing policy maintenance.

Initial Corpus Construction: Authoritative documents—such as regulatory policies, clinical practice guidelines, or peer-reviewed articles—are authored, edited, and vetted for domain alignment (Willats et al., 8 Aug 2025). These documents capture definitional boundaries, edge cases, and exemplars for both compliant and non-compliant cases.
Chunking and Embedding: Source texts are segmented into semantically coherent, overlapping spans to maximize retrieval granularity and minimize context fragmentation. Each chunk is embedded and indexed for efficient similarity search (Willats et al., 8 Aug 2025).
Metadata Management: Metadata assignment enables fine-grained tagging, supporting filtered retrieval by category, identity group, topical area, or recency.
Dynamic Updates: Policy or knowledge updates are applied at the chunk or document level. No retraining of the LLM or retriever is necessary—a key distinction of the curated RAG paradigm (Willats et al., 8 Aug 2025).

This curation discipline ensures the system’s knowledge base remains current, transparent, and actionable without incurring model redevelopment costs.

3. Retrieval, Reranking, and Context Assembly

Effective retrieval and context assembly are critical in aligning LLM output with domain policies.

Query Formation: The input

x

is mapped to a retrieval query

q

via a strict instruction template:

1	"Given the text: ‘<x>’, retrieve the most relevant [domain policy] passages that would help classify whether this text violates policy."

(Willats et al., 8 Aug 2025)

Initial Retrieval: Chunks are scored via cosine similarity and the top- $K$ retrieved. Typical pre-rerank values are $K=20$ for recall; $M\approx3$ –5 for final context (Willats et al., 8 Aug 2025).
Reranking: A cross-encoder or instruction-tuned model assigns $r(i) = P(\text{relevant} \mid q, d_i)$ . The final set $R = \{d_j\}_{j=1}^{M}$ is determined by sorting on $r(i)$ (Willats et al., 8 Aug 2025).
Weighted Fusion: Top chunks can be weighted for priority within the prompt. For instance,

$w_j = \mathrm{softmax}(\tau \cdot s(j))$

with hyperparameter $\tau$ controlling temperature; the highest weighted chunk may be labelled “High priority” in the prompt (Willats et al., 8 Aug 2025).

This two-stage retrieval and controlled fusion maximize relevance and reduce distractor effects, supporting both model explainability and performance.

4. Prompt Engineering and Output Structures

Curated RAG systems use meticulously crafted prompts to enforce domain-specific reasoning and answer completeness.

Prompt Templates: A system prompt enforces structured, policy-aware response formats:

{
  "policy_chunks": [CHUNK₁,…,CHUNK_M],
  "user_content": "<x>",
  "instructions": "Based on the policy_chunks, classify the user_content as Within Policy or Out of Policy, specify the policy category (...), the target identity, and provide a short explanation grounded in the policy."
}

(Willats et al., 8 Aug 2025)

Chunk Interleaving: Retrieved chunks are introduced as a discrete section, preserving ordering and source integrity within the prompt.
Structured Output: The generator is required to emit JSON containing key fields (label, category, target, and policy-grounded explanation). This structure supports post-processing, auditability, and downstream system integration (Willats et al., 8 Aug 2025).
Confidence Handling: Downstream thresholding on label probabilities or log-likelihood can reject uncertain outputs, further increasing robustness.

Such prompt and output discipline enforces adherence to explicit policies, supports regulatory traceability, and ensures outputs are actionable and interpretable.

5. Adaptability, Explainability, and Maintenance

Curated RAG systems significantly enhance operational flexibility and model transparency compared to traditional, end-to-end supervised classifiers.

Dynamic Policy/Knowledge Adaptation: Modification or refinement of domain policies or permissible categories is achieved by updating the associated vector database. The LLM and retriever weights remain frozen (Willats et al., 8 Aug 2025).
Fine-grained Control: The system enables policy changes at both macro (entire document) and micro (chunk, identity group, policy edge case) levels without cascading effects on existing knowledge structures or necessitating revalidation of the LLM (Willats et al., 8 Aug 2025).
Intrinsic Explainability: Each decision is natively accompanied by the precise policy segment(s) used for grounding, as enforced by prompt and structured outputs, supporting both operational audit and model debugging (Willats et al., 8 Aug 2025).
Performance and Consistency: Empirically, curated RAG architectures deliver classification accuracy on par with commercial systems, including in granular, challenging settings such as differentiating protection levels for specific identity groups (Willats et al., 8 Aug 2025).

These properties establish curated RAG systems as uniquely suitable for high-stakes, regulated, or dynamic domains where trust, transparency, and change-management are paramount.

6. Exemplar: Contextual Policy Engine (CPE)

The Contextual Policy Engine (CPE) (Willats et al., 8 Aug 2025) exemplifies a state-of-the-art curated RAG system applied to hate speech policy enforcement.

Component	Implementation Detail
Corpus	Authored policy documents, chunked & embedded in FAISS vector store
Retriever	Fine-tuned encoder w/ cosine similarity + learned cross-encoder
Prompt Template	Structured JSON (policy_chunks, user_content, instructions)
Output Structure	JSON: label, policy category, target group, explanation
Policy Updates	Operate exclusively via changes to vector store content
Explainability	Each label justified with retrieved policy segments

The CPE demonstrates robust baseline performance, explainable outputs, and fine-grained policy control, establishing the curated RAG paradigm as an advance in adaptable, reliable content moderation (Willats et al., 8 Aug 2025).

References

"Classification is a RAG problem: A case study on hate speech detection" (Willats et al., 8 Aug 2025)

PDF Markdown Chat (Pro)

References (1)

Classification is a RAG problem: A case study on hate speech detection (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Curated RAG System.