Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 63 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 19 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 472 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

CRUD-RAG Benchmark Evaluation

Updated 8 September 2025
  • CRUD-RAG Benchmark is a comprehensive framework for assessing retrieval-augmented generation systems using Create, Read, Update, and Delete operations.
  • It evaluates system components including retrieval algorithms, LLM prompting, and knowledge base construction with metrics like RAGQuestEval.
  • The benchmark emphasizes optimized tuning of retrieval, chunking, and generative models to enhance both creative synthesis and factual accuracy in real-world applications.

Retrieval-Augmented Generation (RAG) systems are increasingly deployed to address the limitations of LLMs, particularly in scenarios requiring access to external, domain-specific, or up-to-date information. Conventional benchmarks for RAG systems typically focus on question answering and primarily evaluate the generative component alone, ignoring the impact of retrieval strategies and the construction of the underlying knowledge base. The CRUD-RAG benchmark (Lyu et al., 30 Jan 2024) was developed to fill these gaps by assessing end-to-end RAG pipelines across a broader suite of real-world applications and systematically varying key system parameters. It does so by organizing RAG tasks into four functional categories—Create, Read, Update, and Delete—which enables a comprehensive evaluation of both traditional and emerging RAG system designs, retrieval algorithms, prompting strategies, and generative models.

1. Motivation and Taxonomy

The CRUD-RAG benchmark was designed to overcome two principal limitations of prior RAG evaluation approaches: restrictive focus on question answering and neglect of retrieval/database effects. Traditional RAG evaluation pipelines virtually always measure generative output quality (e.g., answer correctness or semantic similarity) with respect to fixed question-answer pairs, assuming the retrieval and knowledge source modules largely as fixed “helpers.” However, in real-world deployments, the effectiveness of RAG systems relies equally on the interaction between retrieval (document indexing, chunking, candidate selection) and generation. CRUD-RAG adopts the CRUD taxonomy:

  • Create: Generation of new content by synthesizing external knowledge (e.g., text continuation and creative extensions).
  • Read: Information extraction and question answering, both single- and multi-document retrieval.
  • Update: Hallucination mitigation and factual correction by referencing external sources.
  • Delete: Summarization and distillation, removing redundancy and condensing multi-source input into concise outputs.

This taxonomy provides a more realistic mirror of enterprise and research RAG workflows, allowing systematic analysis of performance and tradeoffs across diverse, knowledge-intensive use cases.

2. System Components and Evaluation Pipeline

CRUD-RAG supports analysis and benchmarking of all major RAG system components:

Retriever and Knowledge Base Construction

  • Retrieval algorithms: CRUD-RAG evaluates BM25 (keyword-based), dense vector retrieval models, and hybrid approaches (which may include reranking modules such as bge-rank).
  • Knowledge base construction: News articles are chunked (with tunable chunk size and overlap), indexed into vector databases. Chunking parameters are varied to explore context preservation vs granularity.

Prompt and Context Length

  • Top-k: The number of retrieved chunks supplied to the LLM prompt can be adjusted, testing the effect of evidence density (supporting comprehensive generation in "Create" or precision in "Read" tasks).

Generative Model

  • LLM selection: Multiple LLMs (GPT-3.5, GPT-4, ChatGLM2-6B, Baichuan2-13B, Qwen models) are evaluated to capture the interaction of model scale and retrieval workflow. Notably, open-source models can perform competitively in select scenarios when retrieval hyperparameters are optimized.

Metrics

CRUD-RAG introduces RAGQuestEval, a metric rooted in question-generation-answering cycles over key facts. LaTeX formulations:

Recall:

Recall(GT,GM)=1QG(GT)(q,r)QG(GT)I{QA(GM,q)"<Unanswerable>"}\text{Recall}(GT, GM) = \frac{1}{|Q_G(GT)|} \sum_{(q, r) \in Q_G(GT)} I\{ Q_A(GM, q) \neq \text{"<Unanswerable>"} \}

Precision:

Precision(GT,GM)=1QG(GT)(q,r)QG(GT)F1(QA(GM,q),r)\text{Precision}(GT, GM) = \frac{1}{|Q_G(GT)|} \sum_{(q, r) \in Q_G(GT)} F1(Q_A(GM, q), r)

Here, QGQ_G is a question generator, QAQ_A is the answer extraction function, and F1F1 measures answer overlap. This approach accesses key-information coverage and factual precision, supplementing standard metrics (BLEU, ROUGE-L, BERTScore).

3. Datasets and Scenario Coverage

Each CRUD category is supported by dedicated datasets:

Category Description Dataset Construction
Create Text continuation, creative expansion Split news articles into initial/continuation parts, test RAG-driven generation
Read Single/multi-document QA Fact extraction (single-news); synthesis/reasoning (multi-news)
Update Hallucination modification Derived partly from UHGEval dataset, with GPT-4-corrected hallucinated sentences
Delete Multi-document summarization Reverse construction: generate news events/summaries, pair with related articles

These datasets enable assessment of both fidelity and creativity across knowledge-intensive application modes.

4. Experimental Insights and Optimization Strategies

Benchmarking reveals nuanced performance landscapes:

  • Create tasks benefit from longer chunks and broader contexts (higher top-k) to maintain narrative structure.
  • Read (QA) tasks, especially for isolated factual extraction, often require smaller chunks to minimize irrelevant or noisy context.
  • Retriever selection: Dense retrievers excel in semantic-heavy, reasoning tasks (multi-document QA), but BM25 outperforms in precise, targeted tasks (hallucination correction).
  • Joint tuning of chunk size, overlap, and top-k is essential. Over-aggregation (large top-k) may dilute answer relevance, while strict filtering risks omitting key evidence.
  • Hybrid retrieval and reranking: Dense + BM25 with reranking enhances reasoning-heavy tasks.
  • LLM choice: For synthesis and creative tasks, larger models (GPT-4, Qwen-14B) are favored; factual QA can be served competitively by smaller, well-tuned models (Baichuan2-13B + optimized retrieval).

5. Evaluation Metrics and Algorithms

The benchmark combines both overlap-based and information-centric metrics:

Metric Measures Usage
BLEU Word overlap, fluency General semantic similarity
ROUGE-L Longest common subsequence Content preservation
BERTScore Semantic similarity via embeddings Distributional matching
RAGQuestEval Fact coverage, precision (see formulas) Key-information fidelity

RAGQuestEval enables decomposition of generative accuracy into recall (coverage of target facts) and precision (faithful expression).

6. Future Directions

CRUD-RAG paves the way for multidisciplinary RAG benchmarks and systems:

  • Domain expansion: From Chinese news to broader linguistic, legal, financial, and medical domains to address cross-domain generality.
  • Enhanced metrics: Further development of automated, information-centric measures approximating human judgment of factuality, coherence, and creative adequacy.
  • Adaptive & Reinforcement-enhanced retrieval: Algorithms dynamically tune chunking and context parameters per query/task, potentially using chain-of-thought or RL-driven strategies.
  • Integrated (end-to-end) training: Simultaneous optimization of retriever, database construction, and generation components (as opposed to pipeline tuning).
  • Scalability and deployment: Address computational constraints, with focus on retrieval quality and inference cost boundaries in large-scale production.

7. Impact and Research Significance

CRUD-RAG establishes a multi-dimensional evaluation protocol for RAG, advancing beyond pure question answering by capturing the full spectrum of practical use cases. It exposes workflows and tradeoffs essential for system tuning and design, particularly in settings requiring dynamic external knowledge engagement (summarization, fact correction, creative expansion). Empirical results underscore that holistic system optimization—balancing retrieval, context management, model selection, and database engineering—is critical for robust, high-performance RAG deployments.

This benchmark has prompted further research in robust retrieval, chain-of-thought and adaptive prompting, domain expansion, and multi-stage end-to-end optimization strategies, with direct implications for the evolution of RAG technology in both academic and commercial sectors (Lyu et al., 30 Jan 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube