DataMorgana: Synthetic Q&A Benchmark Tool
- DataMorgana is a lightweight, flexible tool that produces synthetic Q&A benchmarks tailored for evaluating Retrieval-Augmented Generation (RAG) systems.
- It uses a two-stage, prompt-based generation process to enforce user persona and question type diversity, ensuring realistic query simulation.
- By integrating explicit configuration and probabilistic sampling, the tool outperforms legacy methods in lexical, syntactic, and semantic diversity.
DataMorgana is a lightweight, highly configurable tool for generating synthetic Q&A benchmarks tailored to the evaluation of Retrieval-Augmented Generation (RAG) systems. It is designed to address two core deficiencies of prior synthetic data methods: low diversity in generated questions and the inability to model realistic user and question category distributions. By enabling explicit configuration of user personas, question types, and their probabilistic distributions, and by enforcing these during generation, DataMorgana produces Q&A sets that more closely mimic the diversity and variability found in authentic user queries to RAG systems. Its pipeline employs a two-stage generation process, operating efficiently through prompt-based interaction with LLMs, and is empirically demonstrated to surpass existing synthetic Q&A generators in lexical, syntactic, and semantic diversity (Filice et al., 22 Jan 2025).
1. Motivation and Design Objectives
The primary motivation for DataMorgana is the difficulty of collecting large, representative Q&A benchmarks for RAG system evaluation, particularly in domains lacking substantial historic question logs. Hand-authoring these datasets is prohibitively resource-intensive. Legacy synthetic Q&A methods—including vanilla one-shot prompt techniques, InPars, Promptagator, Know Your RAG, RAGAs, and DeepEval—typically prompt an LLM to generate a question and answer for a given source document. While these pipelines may yield fluent Q&A pairs, they consistently exhibit patterns of low diversity (favoring recurring syntactic forms and question types) and provide no mechanism for modeling the mixture of user intents and expertise that occurs in production scenarios.
DataMorgana addresses these limitations by offering a JSON-based configuration schema that allows benchmark creators to specify arbitrary numbers of user personas and question categorizations, each with mutually exclusive categories and individually assigned probability weights. This framework enables fine-grained control of both the content and traffic-like distributions in the resulting Q&A benchmark, closing the gap between general-purpose synthetic generators and the requirements of real-world RAG benchmarks.
2. System Architecture and Pipeline
The DataMorgana generation process consists of two principal stages:
Configuration stage:
A DataMorgana administrator prepares a JSON configuration defining user categorizations (e.g., expert vs. novice, or specific roles such as patient, clinical researcher, public health authority) and question categorizations (e.g., factuality, premise formulation, phrasing style), each described by:
- A short name (“factoid”, “non-factoid-experience”, etc.)
- A probability weight (e.g., 0.25 for “factoid”, 0.75 for “non-factoid-experience”)
- A natural-language description to be inserted verbatim into the LLM prompt
Generation stage:
For each Q&A pair to be generated:
- Sample one category from each user and question categorization according to configured probabilities.
- Sample a source document from the RAG corpus.
- Construct a prompt incorporating the sampled category descriptions, instructing the LLM to generate candidate Q&A pairs (in experiments, Claude-3.5 Sonnet v2 was used).
- Apply lightweight filtering to enforce category compliance and answer fidelity to ; randomly select one passing candidate.
A representative prompt template:
1 2 3 4 5 6 7 8 9 10 11 |
You are a user simulator. Generate [num_questions] candidate questions…
The generated questions should be about facts from the following document:
[document text]
Each question must reflect a user who:
– They must be [user category description]
…
Each question must have the following characteristics:
– It must be [question category description]
…
Return JSON lines:
{ "question": "...", "answer": "..." } |
3. Configuration Flexibility and Combinatorics
DataMorgana's configuration model supports arbitrary numbers of categorizations (both user and question) with categories specified as mutually exclusive and probability-weighted. Internally, the system instantiates the Cartesian product of category selections, such that each generated Q&A pair reflects a unique user persona × question-type combination. For instance, specifying four question categorizations with two categories each (e.g., factuality, phrasing, premise, linguistic variation) yields up to distinct question styles; combining with three user personas results in unique modes. The “Morgana” name evokes this combinatorial shapeshifting across Q&A forms.
A sample configuration fragment (COVID domain):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
{
"categorizations": [
{
"type": "user_expertise",
"categories": [
{"name": "patient", "probability": 0.25, "description": "A regular patient…"},
{"name": "medical_doctor", "probability": 0.25, ...},
{"name": "clinical_researcher", "probability": 0.25, ...},
{"name": "public_health_authority", "probability": 0.25, ...}
]
},
{ "type": "phrasing", ... }
]
} |
This structure enables precise modeling of anticipated traffic patterns in production RAG deployments.
4. Diversity Metrics and Quantitative Evaluation
DataMorgana benchmarks are evaluated with a multidimensional suite of diversity metrics:
- Lexical Diversity:
- Type–Token Ratio (TTR):
- , where is the count of distinct word types and the total token count.
- N-Gram Diversity (NDG):
Self-Repetition Score (SRS):
Fraction of questions sharing any 4-gram with another question:
Compression Ratio (CR):
- Word-CR:
- PoS-CR: Same definition, computed on Part-of-Speech tag sequences
- Homogenization Score (HS):
Mean pairwise cosine similarity in embedding space:
- Parse-Tree Entropy:
, where is the empirical frequency of parse-tree shape (used optionally; PoS-CR and distinct PoS templates are the core syntactic proxies in evaluation).
Empirical results on COVID-QA and open-NQ corpora demonstrate that DataMorgana matches or surpasses human-authored and baseline synthetic benchmarks in lexical, syntactic, and semantic diversity, as detailed in the following tables:
| Model | NDG ↑ | SRS ↓ | word-CR ↓ | PoS-CR ↓ | emb-HS ↓ |
|---|---|---|---|---|---|
| Vanilla | 1.517 | 0.920 | 5.576 | 7.861 | 0.301 |
| Know Your RAG | 2.358 | 0.613 | 3.879 | 6.271 | 0.265 |
| DeepEval | 2.415 | 0.644 | 3.535 | 5.885 | 0.251 |
| DataMorgana | 2.536 | 0.372 | 3.701 | 5.583 | 0.249 |
| Human (COVID-QA) | 2.484 | 0.365 | 3.380 | 6.212 | 0.182 |
| Model | NDG ↑ | SRS ↓ | word-CR ↓ | PoS-CR ↓ | emb-HS ↓ |
|---|---|---|---|---|---|
| Vanilla | 2.662 | 0.533 | 2.665 | 5.824 | 0.068 |
| Know Your RAG | 2.981 | 0.144 | 2.488 | 5.864 | 0.074 |
| DeepEval | 2.879 | 0.371 | 2.477 | 5.631 | 0.067 |
| DataMorgana | 3.016 | 0.140 | 2.502 | 5.397 | 0.052 |
| Human (open-NQ) | 2.585 | 0.357 | 2.775 | 5.753 | 0.016 |
Ablation studies attribute most diversity gains to the explicit question categorizations; removal of these axes reduces NDG by over 25% (Filice et al., 22 Jan 2025).
5. Experimental and Qualitative Illustration
DataMorgana generates a broad spectrum of Q&A types when compared to standard baselines. For example, with a user_expertise of "patient," phrasing "short-search-query," and factuality "non-factoid," the system outputs queries such as:
1 |
{ "question": "flu vaccine side effects", "answer": "Common side effects include soreness,…" } |
In contrast, vanilla baselines generate repetitively structured, often verbose, questions:
1 |
What are the current limitations of seasonal influenza vaccines that make them less effective than desired? |
This distinction is further accentuated in qualitative analyses, where DataMorgana's output includes web-query style keywords, open-ended user premise statements, expert-level comparative questions, and policy-oriented prompts. The combinatorial sampling of categories directly enforces a diversity more reflective of real user traffic.
6. Efficiency and Iterative Use
The runtime bottleneck of DataMorgana is LLM invocation (one call per Q&A pair, generating candidates per call). Empirical performance using Claude-3.5 Sonnet v2 and standard cloud infrastructure yields:
- Median LLM latency: 3s per prompt
- Generation of 2000 Q&A pairs in approx. 100 minutes
- Per-candidate effective time: approx. 1s (due to three candidates per prompt)
- Overhead from parsing and filtering: <5%
Editing the JSON configuration and re-running the generation loop enables rapid iteration cycles, facilitating agile exploration of benchmark diversity. Parallelization is supported up to the rate limits of the LLM API, lowering wall-clock time for large-scale generation.
7. Deployment Best Practices
Recommended usage involves:
- Cloning the repository
- Installing dependencies:
pip install -r requirements.txt - Preparing the RAG corpus as a JSONL file
- Writing a configuration JSON for user and question categorizations
- Running generation via:
1 2 3 4 5 6 |
python run_datamorgana.py \
--config config.json \
--input documents.jsonl \
--output generated_qas.jsonl \
--LLM claude-3.5-sonnet-v2 \
--k 3 |
- Examining outputs and iteratively adjusting probabilities, descriptions, or number of categories as needed
Practical guidelines include beginning with 2–3 question categorizations (each with 2–4 categories), employing concise, clear category descriptions, assigning probability weights to match expected user traffic (using uniform distributions if uncertain), and monitoring filtering acceptance rates. Parallelization to the maximum allowable LLM API concurrency is recommended for large-scale scenarios.
DataMorgana represents an advance in synthetic benchmark generation for RAG system evaluation, combining prompt-based LLM generation with fine-grained configurability and empirical diversity assurance unavailable in previous toolkits or general-purpose pipelines (Filice et al., 22 Jan 2025).