DataMorgana Toolkit
- DataMorgana Toolkit is a configurable system that creates synthetic Q&A benchmarks by sampling explicit question and user categorizations with controlled probability distributions.
- It employs a two-stage pipeline that first configures category distributions and then generates and filters LLM-produced Q&A candidates for constraint satisfaction.
- Benchmark evaluations show improved diversity metrics, including higher NDG and lower SRS and homogenization scores, compared to fixed-prompt baselines.
DataMorgana is a configurable toolkit for generating highly diverse and customizable synthetic Q&A benchmarks, designed specifically for evaluating Retrieval-Augmented Generation (RAG) systems. It addresses the need for Q&A datasets that encapsulate the variability and complexity of real end-user queries, especially in domain-specific and low-data settings. By introducing explicit controls over question and user categorizations and leveraging a lightweight two-stage generation pipeline with LLMs, DataMorgana enables fine-grained manipulation of the distribution and diversity of generated questions, supporting robust, traffic-reflective benchmark construction (Filice et al., 22 Jan 2025).
1. System Overview and Generation Pipeline
DataMorgana operates through a two-stage pipeline: (1) configuration, and (2) generation. In the configuration stage, user and question categorizations—each with explicit names, probability distributions, and prompt descriptions—are specified, typically via a JSON or YAML file. In the generation stage, the system iteratively (a) samples category combinations and documents, (b) constructs LLM prompts encoding these, (c) collects candidate Q&A pairs per turn, (d) filters for constraint satisfaction, and (e) assembles the benchmark.
Schematic Workflow
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
[Configuration File]
↓
┌─────────────────────────────┐
│ Stage 1: Configuration │
│ – Load JSON config │
│ – Build distributions │
└─────────────────────────────┘
↓
┌─────────────────────────────┐
│ Stage 2: Generation │
│ For i in 1…N: │
│ 1) Sample categories │
│ 2) Sample document │
│ 3) Build LLM prompt │
│ 4) LLM → k candidates │
│ 5) Filter to (qᵢ,aᵢ) │
│ 6) Append to output │
└─────────────────────────────┘
↓
[Q&A Benchmark] |
The generation logic is formalized in the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 |
for m in 1…M: C^{(m)} = category set from config P^{(m)} = probability distribution from config for u in 1…U: U^{(u)}, Q^{(u)} analogously for i in 1…N: # Sample one category per categorization for m in 1…M: cᵢ^{(m)} ~ Cat(C^{(m)}, P^{(m)}) for u in 1…U: uᵢ^{(u)} ~ Cat(U^{(u)}, Q^{(u)}) dᵢ ← uniform_random(D) Pᵢ = build_prompt({cᵢ^{(m)}, uᵢ^{(u)}, dᵢ}) candidates ← LLM.generate(Pᵢ, num_return_sequences=k) valid ← FILTER(candidates, {cᵢ^{(m)}, uᵢ^{(u)}, dᵢ}) if valid ≠ ∅: select (qᵢ, aᵢ) ∈ valid at random B ← B ∪ {(qᵢ, aᵢ)} else: repeat sampling or fallback return B |
2. Specification of Categorizations and Distributions
The core mechanism for guiding Q&A diversity in DataMorgana is explicit categorization control. Each question or user categorization consists of a list of categories, each defined by:
name(string),description(prompt fragment),probability(non-negative float; per-categorization probabilities sum to 1).
Categories and their distributions are sampled independently at each iteration. The category sampling formalism is:
Let with , such that . Then for each turn,
For user categorizations, the joint probability of a sample is the marginals product:
Example Configuration Snippet
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 |
{
"question_categorizations": [
{
"name": "factuality",
"categories": [
{
"name": "factoid",
"probability": 0.25,
"description": "A question seeking a specific, concise piece of information..."
},
{
"name": "non-factoid-experience",
"probability": 0.75,
"description": "A question to get advice or recommendations..."
}
]
}
],
"user_categorizations": [
{
"name": "expertise",
"categories": [
{
"name": "expert",
"probability": 0.5,
"description": "a specialized user with deep understanding of the corpus."
},
{
"name": "novice",
"probability": 0.5,
"description": "a regular user with no understanding of specialized terms."
}
]
}
]
} |
3. Diversity Metrics and Objectives
DataMorgana is optimized to generate Q&A sets exhibiting maximal diversity across lexical, syntactic, and semantic axes.
Principal Metrics
- N-Gram Diversity (NDG):
where is the number of unique n-grams in , and is the total number of n-grams in .
- Self-Repetition Score (SRS):
is the set of questions with at least one repeated 4-gram.
- Compression Ratio (CR):
and are the raw and gzipped file sizes; assessed for both word text and PoS-tag sequence ("word-CR" and "PoS-CR").
- Homogenization Score (HS):
with as a question embedding; lower is better (downward preferred).
A comparative evaluation on the COVID-QA setting shows that DataMorgana produces question sets with higher NDG, lower SRS, and lower PoS-CR and emb-HS than competing baselines:
| Model | NDG ↑ | SRS ↓ | word-CR ↓ | PoS-CR ↓ | emb-HS ↓ |
|---|---|---|---|---|---|
| Vanilla | 1.517 | 0.920 | 5.576 | 7.861 | 0.301 |
| KnowYour | 2.358 | 0.613 | 3.879 | 6.271 | 0.265 |
| DeepEval | 2.415 | 0.644 | 3.535 | 5.885 | 0.251 |
| DataMorgana | 2.536 | 0.372 | 3.701 | 5.583 | 0.249 |
Ablations with categories disabled confirm that question categorizations account for the majority of the observed diversity gain (Filice et al., 22 Jan 2025).
4. Implementation, Efficiency, and Optimization
Dataset synthesis in DataMorgana is computationally dominated by LLM inference, as each Q&A pair requests completions per prompt. Complexity is LLM calls for pairs, plus filtering cost per turn.
Efficiency enhancements include:
- Prompt-template caching: Static prompt regions are compiled once; only category/document interpolations are refreshed per generation.
- Asynchronous batching: Multiple prompts are processed concurrently if supported by the LLM interface.
- Local result caching: Optional caching of results for previously seen (document, category-tuple) pairs.
5. Configuration and Usage
Benchmarks can be tuned via YAML or JSON configuration files specifying categorization hierarchies. Example YAML:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
question_categorizations: - name: factuality categories: - name: factoid probability: 0.3 description: "A concise request for a specific fact..." - name: open-ended probability: 0.7 description: "Invites detailed or exploratory responses..." - name: phrasing categories: - name: concise probability: 0.5 description: "Under 10 words, natural question..." - name: search-query probability: 0.5 description: "Keyword‐style search query..." user_categorizations: - name: expertise categories: - name: novice probability: 0.6 description: "No specialized background." - name: expert probability: 0.4 description: "Deep domain knowledge." |
Sample command-line invocation:
1 2 3 4 5 6 |
datamorgana generate \
--config config.yaml \
--corpus covid_corpus.jsonl \
--num-per-doc 3 \
--LLM-model claude-3.5-sonnet \
--output qa_benchmark.jsonl |
Hyperparameter guidelines advise candidates per turn and 1–4 questions per document; category probabilities should be tuned so that even low-probability but important categories have sufficient representation (a floor of 0.05 is suggested).
6. Comparative Performance and Effect in RAG Benchmarking
Empirical results demonstrate that the combinatorial mixing across question and user categorizations results in hundreds of distinct prompt modes, driving the LLM to generate syntactically, lexically, and semantically varied Q&A instances. Compared to fixed-prompt and shallow-taxonomy baselines, this yields demonstrably superior diversity metrics across both domain-specific and general-knowledge corpora. Ablation studies indicate that fine-grained question categorizations produce the most significant improvements, with user categorizations providing additional, though smaller, variety.
As benchmark diversity scales (via more questions per document or increased corpus breadth), DataMorgana maintains advantageous diversity-to-scale ratios, providing greater topic and stylistic spread compared to alternatives. This is particularly relevant for robust and realistic evaluation of RAG systems, ensuring test Q&A pairs reflect the complex, high-variance nature of actual user queries (Filice et al., 22 Jan 2025).
7. Availability and Prospects
The toolkit will be released for controlled beta-testing by select research groups, particularly in conjunction with the SIGIR'2025 LiveRAG challenge. The lightweight, modular design is expected to facilitate rapid iterations and adaptation to diverse RAG benchmarking scenarios. Future directions may involve broader access, expanded category schema, and integration with additional LLM ecosystems—each aimed at fostering more realistic, coverage-complete RAG evaluation corpora.