DataMorgana: Synthetic Q&A Benchmark Tool

Updated 6 March 2026

DataMorgana is a lightweight, flexible tool that produces synthetic Q&A benchmarks tailored for evaluating Retrieval-Augmented Generation (RAG) systems.
It uses a two-stage, prompt-based generation process to enforce user persona and question type diversity, ensuring realistic query simulation.
By integrating explicit configuration and probabilistic sampling, the tool outperforms legacy methods in lexical, syntactic, and semantic diversity.

DataMorgana is a lightweight, highly configurable tool for generating synthetic Q&A benchmarks tailored to the evaluation of Retrieval-Augmented Generation (RAG) systems. It is designed to address two core deficiencies of prior synthetic data methods: low diversity in generated questions and the inability to model realistic user and question category distributions. By enabling explicit configuration of user personas, question types, and their probabilistic distributions, and by enforcing these during generation, DataMorgana produces Q&A sets that more closely mimic the diversity and variability found in authentic user queries to RAG systems. Its pipeline employs a two-stage generation process, operating efficiently through prompt-based interaction with LLMs, and is empirically demonstrated to surpass existing synthetic Q&A generators in lexical, syntactic, and semantic diversity (Filice et al., 22 Jan 2025).

1. Motivation and Design Objectives

The primary motivation for DataMorgana is the difficulty of collecting large, representative Q&A benchmarks for RAG system evaluation, particularly in domains lacking substantial historic question logs. Hand-authoring these datasets is prohibitively resource-intensive. Legacy synthetic Q&A methods—including vanilla one-shot prompt techniques, InPars, Promptagator, Know Your RAG, RAGAs, and DeepEval—typically prompt an LLM to generate a question and answer for a given source document. While these pipelines may yield fluent Q&A pairs, they consistently exhibit patterns of low diversity (favoring recurring syntactic forms and question types) and provide no mechanism for modeling the mixture of user intents and expertise that occurs in production scenarios.

DataMorgana addresses these limitations by offering a JSON-based configuration schema that allows benchmark creators to specify arbitrary numbers of user personas and question categorizations, each with mutually exclusive categories and individually assigned probability weights. This framework enables fine-grained control of both the content and traffic-like distributions in the resulting Q&A benchmark, closing the gap between general-purpose synthetic generators and the requirements of real-world RAG benchmarks.

2. System Architecture and Pipeline

The DataMorgana generation process consists of two principal stages:

Configuration stage:

A DataMorgana administrator prepares a JSON configuration defining user categorizations (e.g., expert vs. novice, or specific roles such as patient, clinical researcher, public health authority) and question categorizations (e.g., factuality, premise formulation, phrasing style), each described by:

A short name (“factoid”, “non-factoid-experience”, etc.)
A probability weight (e.g., 0.25 for “factoid”, 0.75 for “non-factoid-experience”)
A natural-language description to be inserted verbatim into the LLM prompt

Generation stage:

For each Q&A pair to be generated:

Sample one category from each user and question categorization according to configured probabilities.
Sample a source document $d_i$ from the RAG corpus.
Construct a prompt incorporating the sampled category descriptions, instructing the LLM to generate $k=3$ candidate Q&A pairs (in experiments, Claude-3.5 Sonnet v2 was used).
Apply lightweight filtering to enforce category compliance and answer fidelity to $d_i$ ; randomly select one passing candidate.

A representative prompt template:

You are a user simulator. Generate [num_questions] candidate questions…
The generated questions should be about facts from the following document:
[document text]
Each question must reflect a user who:
– They must be [user category description]
…
Each question must have the following characteristics:
– It must be [question category description]
…
Return JSON lines:
{ "question": "...", "answer": "..." }

3. Configuration Flexibility and Combinatorics

DataMorgana's configuration model supports arbitrary numbers of categorizations (both user and question) with categories specified as mutually exclusive and probability-weighted. Internally, the system instantiates the Cartesian product of category selections, such that each generated Q&A pair reflects a unique user persona × question-type combination. For instance, specifying four question categorizations with two categories each (e.g., factuality, phrasing, premise, linguistic variation) yields up to $2^4 = 16$ distinct question styles; combining with three user personas results in $16 \times 3 = 48$ unique modes. The “Morgana” name evokes this combinatorial shapeshifting across Q&A forms.

A sample configuration fragment (COVID domain):

{
  "categorizations": [
    {
      "type": "user_expertise",
      "categories": [
        {"name": "patient", "probability": 0.25, "description": "A regular patient…"},
        {"name": "medical_doctor", "probability": 0.25, ...},
        {"name": "clinical_researcher", "probability": 0.25, ...},
        {"name": "public_health_authority", "probability": 0.25, ...}
      ]
    },
    { "type": "phrasing", ... }
  ]
}

This structure enables precise modeling of anticipated traffic patterns in production RAG deployments.

4. Diversity Metrics and Quantitative Evaluation

DataMorgana benchmarks are evaluated with a multidimensional suite of diversity metrics:

Lexical Diversity:
- Type–Token Ratio (TTR):
- $\mathrm{TTR}(B) = \frac{|V(B)|}{N(B)}$ , where $|V(B)|$ is the count of distinct word types and $N(B)$ the total token count.
- N-Gram Diversity (NDG):
$\mathrm{NDG}(B) = \sum_{n=1}^{4} \frac{|\text{unique n-grams in } B|}{|\text{all n-grams in } B|}$
Self-Repetition Score (SRS):

Fraction of questions sharing any 4-gram with another question:

$\mathrm{SRS}(B) = \frac{ \#\{q \in B : \exists\, q' \neq q,\, 4gram(q) \cap 4gram(q') \neq \emptyset \}}{ |B| }$

Compression Ratio (CR):
- Word-CR: $\mathrm{CR}(B) = \frac{\text{size}(B)}{\text{size}(\mathrm{gzip}(B))}$
- PoS-CR: Same definition, computed on Part-of-Speech tag sequences
Homogenization Score (HS):

Mean pairwise cosine similarity in embedding space:

$\mathrm{HS}(B) = \frac{1}{|B|(|B|-1)} \sum_{q \neq q' \in B} \cos( e(q), e(q'))$

Parse-Tree Entropy:

$H_{\mathrm{tree}}(B) = -\sum_{t \in \mathcal{T}} p(t) \log p(t)$ , where $p(t)$ is the empirical frequency of parse-tree shape $t$ (used optionally; PoS-CR and distinct PoS templates are the core syntactic proxies in evaluation).

Empirical results on COVID-QA and open-NQ corpora demonstrate that DataMorgana matches or surpasses human-authored and baseline synthetic benchmarks in lexical, syntactic, and semantic diversity, as detailed in the following tables:

Model	NDG ↑	SRS ↓	word-CR ↓	PoS-CR ↓	emb-HS ↓
Vanilla	1.517	0.920	5.576	7.861	0.301
Know Your RAG	2.358	0.613	3.879	6.271	0.265
DeepEval	2.415	0.644	3.535	5.885	0.251
DataMorgana	2.536	0.372	3.701	5.583	0.249
Human (COVID-QA)	2.484	0.365	3.380	6.212	0.182

Model	NDG ↑	SRS ↓	word-CR ↓	PoS-CR ↓	emb-HS ↓
Vanilla	2.662	0.533	2.665	5.824	0.068
Know Your RAG	2.981	0.144	2.488	5.864	0.074
DeepEval	2.879	0.371	2.477	5.631	0.067
DataMorgana	3.016	0.140	2.502	5.397	0.052
Human (open-NQ)	2.585	0.357	2.775	5.753	0.016

Ablation studies attribute most diversity gains to the explicit question categorizations; removal of these axes reduces NDG by over 25% (Filice et al., 22 Jan 2025).

5. Experimental and Qualitative Illustration

DataMorgana generates a broad spectrum of Q&A types when compared to standard baselines. For example, with a user_expertise of "patient," phrasing "short-search-query," and factuality "non-factoid," the system outputs queries such as:

1	{ "question": "flu vaccine side effects", "answer": "Common side effects include soreness,…" }

In contrast, vanilla baselines generate repetitively structured, often verbose, questions:

1	What are the current limitations of seasonal influenza vaccines that make them less effective than desired?

This distinction is further accentuated in qualitative analyses, where DataMorgana's output includes web-query style keywords, open-ended user premise statements, expert-level comparative questions, and policy-oriented prompts. The combinatorial sampling of categories directly enforces a diversity more reflective of real user traffic.

6. Efficiency and Iterative Use

The runtime bottleneck of DataMorgana is LLM invocation (one call per Q&A pair, generating $k=3$ candidates per call). Empirical performance using Claude-3.5 Sonnet v2 and standard cloud infrastructure yields:

Median LLM latency: 3s per prompt
Generation of 2000 Q&A pairs in approx. 100 minutes
Per-candidate effective time: approx. 1s (due to three candidates per prompt)
Overhead from parsing and filtering: <5%

Editing the JSON configuration and re-running the generation loop enables rapid iteration cycles, facilitating agile exploration of benchmark diversity. Parallelization is supported up to the rate limits of the LLM API, lowering wall-clock time for large-scale generation.

7. Deployment Best Practices

Recommended usage involves:

Cloning the repository
Installing dependencies: pip install -r requirements.txt
Preparing the RAG corpus as a JSONL file
Writing a configuration JSON for user and question categorizations
Running generation via:

python run_datamorgana.py \
    --config config.json \
    --input documents.jsonl \
    --output generated_qas.jsonl \
    --LLM claude-3.5-sonnet-v2 \
    --k 3

Examining outputs and iteratively adjusting probabilities, descriptions, or number of categories as needed

Practical guidelines include beginning with 2–3 question categorizations (each with 2–4 categories), employing concise, clear category descriptions, assigning probability weights to match expected user traffic (using uniform distributions if uncertain), and monitoring filtering acceptance rates. Parallelization to the maximum allowable LLM API concurrency is recommended for large-scale scenarios.

DataMorgana represents an advance in synthetic benchmark generation for RAG system evaluation, combining prompt-based LLM generation with fine-grained configurability and empirical diversity assurance unavailable in previous toolkits or general-purpose pipelines (Filice et al., 22 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DataMorgana.

DataMorgana: Synthetic Q&A Benchmark Tool

1. Motivation and Design Objectives

2. System Architecture and Pipeline

3. Configuration Flexibility and Combinatorics

4. Diversity Metrics and Quantitative Evaluation

5. Experimental and Qualitative Illustration

6. Efficiency and Iterative Use

7. Deployment Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DataMorgana: Synthetic Q&A Benchmark Tool

1. Motivation and Design Objectives

2. System Architecture and Pipeline

3. Configuration Flexibility and Combinatorics

4. Diversity Metrics and Quantitative Evaluation

5. Experimental and Qualitative Illustration

6. Efficiency and Iterative Use

7. Deployment Best Practices

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research