Papers
Topics
Authors
Recent
Search
2000 character limit reached

DataMorgana Tool for Synthetic Q&A Benchmarks

Updated 14 February 2026
  • DataMorgana is a lightweight, highly-configurable tool that creates synthetic Q&A benchmarks for evaluating retrieval-augmented generation systems using customizable configurations.
  • It employs a two-stage pipeline that first defines user and question categorizations via a JSON file and then generates diverse Q&A pairs through efficient LLM processing.
  • Evaluation metrics like N-Gram Diversity, Self-Repetition Score, and Embeddings Homogenization confirm its superior performance compared to static and evolutionary methods.

DataMorgana is a lightweight, highly-configurable tool designed for generating synthetic question–answer (Q&A) benchmarks targeted at Retrieval-Augmented Generation (RAG) system evaluation. The tool addresses core challenges in constructing robust and diverse evaluation datasets for RAG settings, particularly in domain-specific applications where real-world data may be scarce or imbalanced. By enabling detailed control over user and question categories and their empirical distributions, DataMorgana aims to better mirror diverse user interactions observed in deployed RAG systems, ensuring thorough assessment of RAG performance across a spectrum of query and user types (Filice et al., 22 Jan 2025).

1. System Architecture and Two-Stage Generation Pipeline

DataMorgana operates via a two-stage generation pipeline, prioritizing both customization and computational efficiency:

  • Stage A: Configuration. Practitioners define the “shape” of expected RAG-system traffic by specifying user categorizations (e.g., expertise levels such as “novice” or “expert”; occupational roles such as “researcher” or “public-health official”) and question categorizations (e.g., question type, phrasing style, degree of factuality, linguistic distance). These are encoded in a single JSON file, with mutually exclusive category definitions, each accompanied by a name, natural-language description, and target probability pcp_c. No pre-processing of the underlying corpus, knowledge graph construction, or heavy pipeline optimization is required.
  • Stage B: Generation. For each required Q&A pair, the tool:

    1. Samples one category from each user-level and question-level categorization according to the defined categorical distribution (P(category=c)=pcP(\text{category}=c)=p_c).
    2. Samples a document did_i from the RAG corpus.
    3. Synthesizes a prompt including both the relevant category descriptions and the input document (or excerpt), and instructs the LLM (e.g., Claude-3.5 Sonnet) to generate kk diverse candidate Q&A pairs in JSON format.
    4. Filters the resulting candidates for both faithfulness to the document and compliance with category constraints, and randomly selects a single valid pair to add to the benchmark.

Single-pass prompt orchestration (with optional lightweight filtering) distinguishes DataMorgana from prior multi-step or graph-building solutions, enabling efficient generation and rapid iteration (Filice et al., 22 Jan 2025).

2. Configuration Model and Sampling Logic

The DataMorgana configuration model is centered on a set of independent categorizations for both user and question characteristics. Each categorization CC is represented as a finite set {c1,,cm}\{c_1, \dots, c_m\}, where each cjc_j has:

  • Name ncjn_{c_j},

  • Natural-language description desccj\mathrm{desc}_{c_j},

  • Target probability pcjp_{c_j}, such that j=1mpcj=1\sum_{j=1}^m p_{c_j} = 1.

Sampling proceeds independently for each categorization using a categorical distribution:

cCategorical(pc1,,pcm)c \sim \mathrm{Categorical}(p_{c_1}, \dots, p_{c_m})

The joint distribution over multiple categorizations is thus the product of their marginals, yielding a combinatorial space of user–question category intersections that drives benchmark diversity. The explicit configuration is specified in a structured JSON schema, directly linking design intent and synthetic data characteristics. For example, a “Phrasing” categorization may include entries such as:

1
2
3
4
5
6
7
{
  "name": "Phrasing",
  "categories": [
    {"name": "concise-and-natural", "probability": 0.30, "description": "..."},
    ...
  ]
}
This approach offers fine-grained control over both categorical compositions and empirical frequency distributions within the benchmark.

3. Diversity Metrics and Benchmark Evaluation

DataMorgana benchmarks are evaluated across three principal axes of diversity: lexical, syntactic, and semantic. These are quantified using established corpus-level metrics:

  • Lexical Diversity

    • N-Gram Diversity (NDG):

    NDG(B)=n=14unique n-grams in Ball n-grams in B\mathrm{NDG}(B) = \sum_{n=1}^4 \frac{|\text{unique } n\text{-grams in } B|}{|\text{all } n\text{-grams in } B|}

    Higher NDG indicates greater word-sequence variability. - Self-Repetition Score (SRS):

    SRS(B)=#{qB:q contains a 4-gram appearing elsewhere}B\mathrm{SRS}(B) = \frac{\#\{q \in B : q \text{ contains a 4\text{-gram} appearing elsewhere}\}}{|B|}

    Lower values are preferable. - Word-Compression Ratio (word-CR): The ratio of corpus size to gzip-compressed size; lower ratios reflect higher diversity and reduced redundancy.

  • Syntactic Diversity

    • PoS-Compression Ratio (PoS-CR): Analogous to word-CR, but applied to the sequence of Part-of-Speech tags. Lower PoS-CR values indicate more varied syntactic constructions.
  • Semantic Diversity

    • Embeddings-Homogenization Score (emb-HS):

    HS(B)=1B(B1)qqcos(emb(q),emb(q))\mathrm{HS}(B) = \frac{1}{|B|(|B|-1)} \sum_{q \neq q'} \cos(\mathrm{emb}(q), \mathrm{emb}(q'))

    Calculated using all-MiniLM-L6-v2. Lower scores indicate more semantically distinct questions.

Ablation studies demonstrate that exclusions of question categorizations significantly degrade all diversity metrics, confirming their primary role in diversity maximization (Filice et al., 22 Jan 2025).

4. Experimental Validation and Comparative Results

DataMorgana has been quantitatively and qualitatively benchmarked against leading baselines using both domain-specific and open-domain corpora:

  • CORD-19 (healthcare): 147 documents, yielding 2,019 Q&A pairs.
  • Wikipedia (general knowledge): 2,889 passages.

The following methods were compared, controlled for identical prompt budgets and LLM versions: - Vanilla: single static prompt, no category control, - Know Your RAG: three fixed question types via a three-step LLM pipeline, - DeepEval: one-step evolutionary diversification with seven random transformations.

On CORD-19, DataMorgana achieved superior or best-in-class results:

  • NDG = 2.536 (vs. 1.517 Vanilla, 2.415 DeepEval)
  • SRS = 0.372 (vs. 0.920 Vanilla, 0.644 DeepEval)
  • word-CR = 3.701 (vs. 5.576 Vanilla, 3.535 DeepEval)
  • PoS-CR = 5.583 (vs. 7.861 Vanilla, 5.885 DeepEval)
  • emb-HS = 0.249 (vs. 0.301 Vanilla, 0.251 DeepEval)

On Wikipedia, DataMorgana led in NDG and PoS-CR and matched or improved upon semantic diversity metrics.

These results collectively indicate DataMorgana’s ability to systematically improve the breadth of benchmark coverage compared to static and evolutionary diversification strategies (Filice et al., 22 Jan 2025).

5. Usage Workflow and Practical Customization

DataMorgana is designed for ease of integration into RAG system development and evaluation pipelines. The typical workflow consists of:

  1. Preparing the RAG corpus (plain text or JSONL).
  2. Authoring a JSON configuration that encodes user and question categorizations and their target distributions.
  3. Executing the tool from the command line:
    1
    2
    3
    4
    5
    6
    7
    
    datamorgana generate \
        --config config.json \
        --corpus path/to/corpus.jsonl \
        --LLM claude-3.5 \
        --k 3 \
        --num-per-doc 1 \
        --output synthetic_benchmark.jsonl
  4. Inspecting and optionally filtering the output.

Key parameters include:

  • --k: Number of candidate Q&A pairs per prompt (default 3).
  • --num-per-doc: Number of Q&A pairs generated per document.
  • --seed: Random seed for reproducibility.
  • --filter-thresholds: Semantic similarity cutoffs for post-filtering.

Benchmark customizations are recommended for domain-specific deployments, such as organizing user roles (“doctor,” “patient,” etc.) and calibrating probability mass to reflect anticipated traffic (e.g., higher prevalence of factoid queries). Iterative refinement of category descriptions via the JSON–prompt cycle is strongly supported by DataMorgana’s lightweight design (Filice et al., 22 Jan 2025).

6. Significance and Prospects

By unifying expressive, interpretable configurations with an efficient one-pass LLM generation and filtering paradigm, DataMorgana enables rapid production of synthetic Q&A benchmarks that capture varied and realistic user–question distributions in RAG contexts. The empirical evidence supports its state-of-the-art performance in maximizing lexicosemantic and syntactic diversity while remaining computationally lightweight and flexible (Filice et al., 22 Jan 2025).

DataMorgana was made available to select research groups as beta testers (notably in the context of the SIGIR'2025 LiveRAG challenge). Its methodologically transparent design and robust evaluation activity suggest it may serve as a foundational tool for both benchmarking and advancing model robustness and adaptation in retrieval-augmented generation research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DataMorgana Tool.