DataMorgana Toolkit

Updated 15 February 2026

DataMorgana Toolkit is a configurable system that creates synthetic Q&A benchmarks by sampling explicit question and user categorizations with controlled probability distributions.
It employs a two-stage pipeline that first configures category distributions and then generates and filters LLM-produced Q&A candidates for constraint satisfaction.
Benchmark evaluations show improved diversity metrics, including higher NDG and lower SRS and homogenization scores, compared to fixed-prompt baselines.

DataMorgana is a configurable toolkit for generating highly diverse and customizable synthetic Q&A benchmarks, designed specifically for evaluating Retrieval-Augmented Generation (RAG) systems. It addresses the need for Q&A datasets that encapsulate the variability and complexity of real end-user queries, especially in domain-specific and low-data settings. By introducing explicit controls over question and user categorizations and leveraging a lightweight two-stage generation pipeline with LLMs, DataMorgana enables fine-grained manipulation of the distribution and diversity of generated questions, supporting robust, traffic-reflective benchmark construction (Filice et al., 22 Jan 2025).

1. System Overview and Generation Pipeline

DataMorgana operates through a two-stage pipeline: (1) configuration, and (2) generation. In the configuration stage, user and question categorizations—each with explicit names, probability distributions, and prompt descriptions—are specified, typically via a JSON or YAML file. In the generation stage, the system iteratively (a) samples category combinations and documents, (b) constructs LLM prompts encoding these, (c) collects $k$ candidate Q&A pairs per turn, (d) filters for constraint satisfaction, and (e) assembles the benchmark.

Schematic Workflow

[Configuration File]
        ↓
┌─────────────────────────────┐
│  Stage 1: Configuration   │
│  – Load JSON config       │
│  – Build distributions    │
└─────────────────────────────┘
        ↓
┌─────────────────────────────┐
│  Stage 2: Generation      │
│  For i in 1…N:            │
│    1) Sample categories   │
│    2) Sample document     │
│    3) Build LLM prompt    │
│    4) LLM → k candidates  │
│    5) Filter to (qᵢ,aᵢ)   │
│    6) Append to output    │
└─────────────────────────────┘
        ↓
   [Q&A Benchmark]

The generation logic is formalized in the following pseudocode:

for m in 1…M:
    C^{(m)} = category set from config
    P^{(m)} = probability distribution from config
for u in 1…U:
    U^{(u)}, Q^{(u)} analogously

for i in 1…N:
    # Sample one category per categorization
    for m in 1…M:
        cᵢ^{(m)} ~ Cat(C^{(m)}, P^{(m)})
    for u in 1…U:
        uᵢ^{(u)} ~ Cat(U^{(u)}, Q^{(u)})

    dᵢ ← uniform_random(D)
    Pᵢ = build_prompt({cᵢ^{(m)}, uᵢ^{(u)}, dᵢ})
    candidates ← LLM.generate(Pᵢ, num_return_sequences=k)
    valid ← FILTER(candidates, {cᵢ^{(m)}, uᵢ^{(u)}, dᵢ})
    if valid ≠ ∅:
        select (qᵢ, aᵢ) ∈ valid at random
        B ← B ∪ {(qᵢ, aᵢ)}
    else:
        repeat sampling or fallback

return B

2. Specification of Categorizations and Distributions

The core mechanism for guiding Q&A diversity in DataMorgana is explicit categorization control. Each question or user categorization consists of a list of categories, each defined by:

name (string),
description (prompt fragment),
probability (non-negative float; per-categorization probabilities sum to 1).

Categories and their distributions are sampled independently at each iteration. The category sampling formalism is:

Let $C^{(m)} = \{c^{(m)}_1, \ldots, c^{(m)}_{K_m}\}$ with $P^{(m)} = (p^{(m)}_1, \ldots, p^{(m)}_{K_m})$ , such that $\sum_{k} p^{(m)}_k = 1$ . Then for each turn,

$c^{(m)}_i \sim \mathrm{Cat}(C^{(m)}, P^{(m)}), \quad \Pr[c^{(m)}_i = c^{(m)}_k] = p^{(m)}_k.$

For $U$ user categorizations, the joint probability of a sample is the marginals product:

$P_{\mathrm{joint}} = \prod_{m=1}^M p^{(m)}_{k_m} \prod_{u=1}^U q^{(u)}_{\ell_u}.$

Example Configuration Snippet

{
  "question_categorizations": [
    {
      "name": "factuality",
      "categories": [
        {
          "name": "factoid",
          "probability": 0.25,
          "description": "A question seeking a specific, concise piece of information..."
        },
        {
          "name": "non-factoid-experience",
          "probability": 0.75,
          "description": "A question to get advice or recommendations..."
        }
      ]
    }
  ],
  "user_categorizations": [
    {
      "name": "expertise",
      "categories": [
        {
          "name": "expert",
          "probability": 0.5,
          "description": "a specialized user with deep understanding of the corpus."
        },
        {
          "name": "novice",
          "probability": 0.5,
          "description": "a regular user with no understanding of specialized terms."
        }
      ]
    }
  ]
}

3. Diversity Metrics and Objectives

DataMorgana is optimized to generate Q&A sets exhibiting maximal diversity across lexical, syntactic, and semantic axes.

Principal Metrics

N-Gram Diversity (NDG):

$\mathrm{NDG}(B) = \sum_{n=1}^{4} \frac{|U_n(B)|}{|T_n(B)|}$

where $U_n(B)$ is the number of unique n-grams in $B$ , and $T_n(B)$ is the total number of n-grams in $B$ .

Self-Repetition Score (SRS):

$\mathrm{SRS}(B)=\frac{|R|}{|B|} \quad (\downarrow\text{ preferred})$

$R$ is the set of questions with at least one repeated 4-gram.

Compression Ratio (CR):

$\mathrm{CR}(B) = \frac{|B|}{|B|_{(\mathrm{gz})}} \quad (\downarrow\text{ preferred})$

$|B|$ and $|B|_{(\mathrm{gz})}$ are the raw and gzipped file sizes; assessed for both word text and PoS-tag sequence ("word-CR" and "PoS-CR").

Homogenization Score (HS):

$\mathrm{HS}(B) = \frac{1}{|B|(|B|-1)} \sum_{q \ne q' \in B} \cos(e(q), e(q'))$

with $e(q)$ as a question embedding; lower is better (downward preferred).

A comparative evaluation on the COVID-QA setting shows that DataMorgana produces question sets with higher NDG, lower SRS, and lower PoS-CR and emb-HS than competing baselines:

Model	NDG ↑	SRS ↓	word-CR ↓	PoS-CR ↓	emb-HS ↓
Vanilla	1.517	0.920	5.576	7.861	0.301
KnowYour	2.358	0.613	3.879	6.271	0.265
DeepEval	2.415	0.644	3.535	5.885	0.251
DataMorgana	2.536	0.372	3.701	5.583	0.249

Ablations with categories disabled confirm that question categorizations account for the majority of the observed diversity gain (Filice et al., 22 Jan 2025).

4. Implementation, Efficiency, and Optimization

Dataset synthesis in DataMorgana is computationally dominated by LLM inference, as each Q&A pair requests $k$ completions per prompt. Complexity is $O(Nk)$ LLM calls for $N$ pairs, plus filtering cost $O(kL)$ per turn.

Efficiency enhancements include:

Prompt-template caching: Static prompt regions are compiled once; only category/document interpolations are refreshed per generation.
Asynchronous batching: Multiple prompts are processed concurrently if supported by the LLM interface.
Local result caching: Optional caching of results for previously seen (document, category-tuple) pairs.

5. Configuration and Usage

Benchmarks can be tuned via YAML or JSON configuration files specifying categorization hierarchies. Example YAML:

question_categorizations:
  - name: factuality
    categories:
      - name: factoid
        probability: 0.3
        description: "A concise request for a specific fact..."
      - name: open-ended
        probability: 0.7
        description: "Invites detailed or exploratory responses..."
  - name: phrasing
    categories:
      - name: concise
        probability: 0.5
        description: "Under 10 words, natural question..."
      - name: search-query
        probability: 0.5
        description: "Keyword‐style search query..."
user_categorizations:
  - name: expertise
    categories:
      - name: novice
        probability: 0.6
        description: "No specialized background."
      - name: expert
        probability: 0.4
        description: "Deep domain knowledge."

Sample command-line invocation:

datamorgana generate \
    --config config.yaml \
    --corpus covid_corpus.jsonl \
    --num-per-doc 3 \
    --LLM-model claude-3.5-sonnet \
    --output qa_benchmark.jsonl

Hyperparameter guidelines advise $k=2\text{–}5$ candidates per turn and 1–4 questions per document; category probabilities should be tuned so that even low-probability but important categories have sufficient representation (a floor of 0.05 is suggested).

6. Comparative Performance and Effect in RAG Benchmarking

Empirical results demonstrate that the combinatorial mixing across $M$ question and $U$ user categorizations results in hundreds of distinct prompt modes, driving the LLM to generate syntactically, lexically, and semantically varied Q&A instances. Compared to fixed-prompt and shallow-taxonomy baselines, this yields demonstrably superior diversity metrics across both domain-specific and general-knowledge corpora. Ablation studies indicate that fine-grained question categorizations produce the most significant improvements, with user categorizations providing additional, though smaller, variety.

As benchmark diversity scales (via more questions per document or increased corpus breadth), DataMorgana maintains advantageous diversity-to-scale ratios, providing greater topic and stylistic spread compared to alternatives. This is particularly relevant for robust and realistic evaluation of RAG systems, ensuring test Q&A pairs reflect the complex, high-variance nature of actual user queries (Filice et al., 22 Jan 2025).

7. Availability and Prospects

The toolkit will be released for controlled beta-testing by select research groups, particularly in conjunction with the SIGIR'2025 LiveRAG challenge. The lightweight, modular design is expected to facilitate rapid iterations and adaptation to diverse RAG benchmarking scenarios. Future directions may involve broader access, expanded category schema, and integration with additional LLM ecosystems—each aimed at fostering more realistic, coverage-complete RAG evaluation corpora.

Markdown Report Issue Upgrade to Chat

References (1)

Generating Diverse Q&A Benchmarks for RAG Evaluation with DataMorgana (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DataMorgana Toolkit.

DataMorgana Toolkit

1. System Overview and Generation Pipeline

Schematic Workflow

2. Specification of Categorizations and Distributions

Example Configuration Snippet

3. Diversity Metrics and Objectives

Principal Metrics

4. Implementation, Efficiency, and Optimization

5. Configuration and Usage

6. Comparative Performance and Effect in RAG Benchmarking

7. Availability and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DataMorgana Toolkit

1. System Overview and Generation Pipeline

Schematic Workflow

2. Specification of Categorizations and Distributions

Example Configuration Snippet

3. Diversity Metrics and Objectives

Principal Metrics

4. Implementation, Efficiency, and Optimization

5. Configuration and Usage

6. Comparative Performance and Effect in RAG Benchmarking

7. Availability and Prospects

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research