- The paper offers a systematic review of RAG methods, integrating neural retrieval with generative LMs to enhance factuality and explainability.
- It categorizes various architectures—sparse, dense, hybrid, graph-based, and active retrieval—and evaluates them using metrics like Recall@k, BLEU, and ROUGE.
- The review identifies scalability, security, and performance tuning as major challenges while outlining a roadmap for adaptive, robust RAG systems.
Systematic Literature Review of Retrieval-Augmented Generation: Techniques, Metrics, and Challenges
Introduction
This systematic review synthesizes the post-2020 research landscape on Retrieval-Augmented Generation (RAG), focusing on highly cited work selected by a PRISMA-compliant protocol. RAG architectures interleave neural information retrieval with generative LMs, aiming to ground model outputs in non-parametric, up-to-date evidence while retaining the semantic generalization embedded in pretraining. The review establishes the dominant trends, innovative architectures, evaluation practices, empirical outcomes, and outstanding challenges in the field. Its findings elucidate the state of RAG and provide a roadmap for future research targeting robust, scalable, and controllable retrieval-augmented systems.
Methods: Systematic Review Protocol
A three-phase PRISMA methodology was instantiated using major scholarly databases and DBLP to identify relevant work published between 2020 and May 2025. Inclusion was citation-filtered (≥30 for pre-2025, ≥15 for 2025) and required explicit attention to retrieval-augmented generation where retrieval was central and output was text. Studies were screened, extracted, and cross-verified, with LLM support informing but not replacing human judgement. Extraction covered data sources, architectures (retriever, chunker, encoder, generator), evaluation metrics, and domain/task coverage.
Canonical RAG: Baseline Architectures and Variants
The foundational baseline is the dual-encoder Dense Passage Retrieval (DPR) retriever paired to a sequence-to-sequence generator, as exemplified by the original RAG pipeline. Variants are systematically characterized by their divergence from this pipeline along modular axes—retrieval, chunking, encoding, generation, and training/triggers.
Retrieval Mechanisms
- Sparse retrieval (BM25): high speed and interpretability; weak semantic matching.
- Dense retrieval (DPR, ANCE, Contriever): strong semantic recall using bi-encoder Transformer-based models with ANN. Pragmatic for open-domain QA, code, and biomedicine.
- Hybrid retrievers: fuse sparse and dense signals, leverage reciprocal rank fusion or learning-based mixtures. Markedly improve recall and robustness across domains.
- Graph-based retrieval: retrieve subgraphs via structures (e.g., KGs, code graphs), supporting multi-hop, compositional, or explainable reasoning.
- Iterative/active retrieval: LLM actively emits or triggers new retrievals based on output uncertainties, enabling stepwise refinement and efficient context budget allocation.
- Domain/multimodal retrievers: extend to image, tabular, or code-based retrieval, typically requiring bespoke architectures and bespoke vector stores.
Vector Stores and Chunking
- Vector databases (e.g., FAISS, Pinecone, Chroma): critical for large-scale, sub-ms retrieval. Resource–accuracy tradeoffs remain critical in distributed/cloud and domain-specific settings.
- Chunking: evolves from static (fixed-length, e.g., 100 words/tokens) to semantic/syntactic (sentence, paragraph, section boundary), domain-specific (code, graphs), and dynamic/adaptive schemes. Empirically, semantic or domain-aware chunking improves retrieval accuracy, though at increased pipeline complexity.
Encoders
- Sparse encoders: e.g., TF-IDF, BM25. Often used to supplement dense retrieval as first-pass filters.
- Dense encoders: Dual-encoder Transformers (BERT, Sentence-BERT, domain variants for code/biomed), API-driven (e.g., OpenAI ADA), and multimodal/graph-based models (CLIP, GATs).
- Hybrid and multimodal encoders: Fuse or align signals from heterogeneous sources for cross-domain or cross-modality retrieval.
Generation Models
- Encoder–decoder (seq2seq, e.g., T5, BART): facilitate cross-attention for multi-passage fusion, robust in synthesis/explanation tasks.
- Decoder-only (e.g., GPT-3/4, Llama): leverage retrieval via prompt concatenation, adapters, or reflection tokens.
- Multimodal decoders: support RAG in vision-language with aligned text/image streams.
Training Paradigms
- Joint end-to-end: maximize log-likelihood over combined retrieval+generation; requires significant compute/memory.
- Two-stage modular: retrievers and generators trained/fine-tuned separately; facilitates stability, at some cost to global optimality.
- Parameter-efficient/few-shot (PEFT): LoRA, adapters, and in-context learning substantially lower compute requirements.
- Multi-objective and domain-adaptive: integrated contrastive, SCST, or style-aware losses yield task-specific improvement but complicate tuning.
Innovations Beyond Baseline RAG
Dataflow/Orchestration
- Pre-retrieval: Structure-aware chunking, metadata enrichment, curated corpus filtering, token/security-aware chunking.
- Post-retrieval: Evidence reranking (reciprocal rank fusion), context filtering, adaptive passage selection, and utility-based inclusion for token budget and quality trade-off.
Prompting and Query Strategies
- Active prompting: Model triggers retrieval based on token- or entropy-level uncertainty (FLARE, RIND), supports early rejection or evidence expansion.
- Structural and schema-aware prompting: Enforce output regularity, evidence alignment, and provenance using schemas/wrappers.
- Exemplar and in-context augmentation: Dynamically select support questions/answers/examples to improve alignment.
- Deliberate/planned reasoning (ReAct/graph-of-thought): Interleaves reasoning, action, and observation; iteratively retrieves and refines context.
Hybrid, Structured, and Iterative Retrieval
- Hybrid retrievers: Demonstrably superior recall—especially in medical/finance/software—by combining/hybridizing indices and/or learning fusion weights.
- Graph- and structure-aware RAG: Entity/edge-centric retrieval, graph traversal, or soft projection into model context enhances multi-hop reasoning and explainability but raises new evaluation, update, and memory challenges.
- Iterative/active loops: Model reflects on knowledge gaps during decoding, issues on-demand retrieval, supports multi-turn/document reasoning, and can flag, revisit, or retry failed or uncertain outputs.
Memory-Augmented RAG
- Buffers and persistent memory: From short-turn context maintenance to entity-centric/user-specific knowledge graphs with retention/privacy modules. Empirically reduces hallucinations, repetition, and context sprawl.
- Clustered/multi-level memory: Abstracts dynamic user/session/task history for long-horizon coherence or personalized workflows.
- Agent orchestration: Expose multi-capability toolboxes (retrievers, rerankers, memories, calculators, external APIs), with static, dynamic, or RL-based controllers. Empowers adaptive workflows but requires advances in credit assignment and memory/button governance.
Efficiency and Compression
- Extreme context reduction: Techniques like xRAG (single-token projection per document), overlapping compute (PipeRAG), and proactive cache/warm-up (RAGCache) achieve major GFLOPs and latency reductions while maintaining accuracy.
- Budget-aware orchestration: Dynamic adaptation of context window, batching, compression, and retrieval cadence to user/system constraints; critical for production deployment.
Multimodality
- Unified vector spaces: Vision (CLIP, ViT), text, audio, and tabular inputs aligned to shared representations. RAG is extended seamlessly to medical imaging, scientific graphs, and web/corporate data.
Evaluation Metrics and Datasets
Automated Metrics
- Generation-focused: EM, F1, BLEU, ROUGE, METEOR, BERTScore, perplexity.
- Retrieval focus: Recall@k, Precision@k, MAP@k, MRR, nDCG, R-Precision, Hit@k, context relevance.
- Specialized: Human/LLM-as-judge correctness/quality, hallucination/rejection/success rates, efficiency (latency, response time), and adversarial robustness.
Human and LLM-as-judge
- Human annotation is essential for groundedness, hallucination, usability, and satisfaction but faces annotation cost, bias, and agreement challenges. LLM-judges (typically GPT-3.5/4) can replicate judgements but require careful calibration and transparency.
Benchmarks and Datasets
- Open-domain (Wikipedia, NQ, MS MARCO), domain-centric (legal, biomedical), multi-hop (HotpotQA, 2WikiMultihopQA), and multimodal (COCO, LAION, scientific charts, code).
- Evaluation is increasingly task- and resource-specific, with a strong trend toward comprehensive metric suites (quality, recall, latency, energy, safety, privacy).
Key Challenges and Limitations
Computational Resource and Scalability
Trade-offs between accuracy, latency, and cost are unresolved. ANN indices and GPU/CPU scheduling improve throughput but are sensitive to corpus size and deployment architectures.
Noisy/Heterogeneous/Multi-modal Corpus
Hybrid signals, mismatched modalities, and complex formats expose the need for learnable fusion, noise-aware scoring, and efficient validation. Robust graph-based RAG remains fragile under entity-linking and structure drift.
Domain Shift/Generalization
Overfitting to single domains, language, or corpus preparation erodes cross-domain robustness. Chunk size, k-selection, and cache management can induce major performance swings. Evaluation remains Anglo/wikipedia-centric and slow to propagate fresh data/corpus updates.
Error Cascades and Modular Fragility
Multi-stage, modular workflows are subject to cascading errors—early retrieval mistakes bias downstream generations. Iterative and memory-augmented systems risk cache drift and increasingly complex debugging. Confidence/uncertainty mechanisms are necessary but not widely adopted.
Model Limitations and Safety
Both proprietary and open-source LLMs are constrained in token window, prompt format, and debiasing capability. Parameter-efficient tuning and privacy-aware memory are immature. Hallucination, prompt attacks, and subtle prompt structure errors remain hazards, especially in high-stakes domains.
Security Threats
RAG systems are vulnerable along the retrieval pipeline—vector store poisoning, prompt-based leakage, jailbreaks, and adversarial attacks can compromise outputs at scale, even when standard LLM safeguards are in place. Existing defences (perplexity filters, majority-vote reranking, refusal classifiers) are, at best, partial. Cryptographic provenance, anomaly detection, retrieval–generation joint training, and rigorous attack-aware benchmarks are vital research directions.
Implications and Outlook
The reviewed literature demonstrates that RAG offers measurable improvements in factuality, freshness, and explainability compared to purely generative models. The field has advanced beyond monolithic DPR-based pipelines towards configurable, policy-driven systems that dynamically allocate compute, select evidence sources, compress context, and orchestrate diverse tools in agentic workflows. This modularity yields significant gains in robustness, retrieval quality, adaptation, and efficiency, but introduces new issues: security/poisoning, memory management and governance, long-horizon consistency, and complex performance tuning under real-world constraints.
Theoretically, the research points toward unified RAG systems capable of cross-modal, cross-domain reasoning and adaptive retrieval. Real-world deployment is bottlenecked by incomplete evaluation practice (particularly cost, latency, and safety reporting), insufficient measurement of robustness or generalization, and immature countermeasures against corpus and pipeline attacks.
Future work should prioritize:
- Holistic, open-access benchmarks reporting accuracy, efficiency, and security;
- Policy-driven retrieval controllers budgeting time, tokens, and energy;
- Plug-and-play, provenance-aware orchestration frameworks supporting arbitrary retrieval/generation modules;
- Memory systems with privacy, audit, retention, and forgetting mechanisms.
Conclusion
By systematically cataloguing 128 significant studies, this review establishes a contemporary reference for RAG techniques, metrics, and limitations. While RAG has become the dominant pattern for knowledge-grounded LLMs, its future utility hinges on closing open challenges in efficiency, robustness, evaluation, and security. Only then can retrieval-augmented generation systems transition from research prototypes to dependable, scalable infrastructure for critical NLP applications.