DataMorgana Tool for Synthetic Q&A Benchmarks
- DataMorgana is a lightweight, highly-configurable tool that creates synthetic Q&A benchmarks for evaluating retrieval-augmented generation systems using customizable configurations.
- It employs a two-stage pipeline that first defines user and question categorizations via a JSON file and then generates diverse Q&A pairs through efficient LLM processing.
- Evaluation metrics like N-Gram Diversity, Self-Repetition Score, and Embeddings Homogenization confirm its superior performance compared to static and evolutionary methods.
DataMorgana is a lightweight, highly-configurable tool designed for generating synthetic question–answer (Q&A) benchmarks targeted at Retrieval-Augmented Generation (RAG) system evaluation. The tool addresses core challenges in constructing robust and diverse evaluation datasets for RAG settings, particularly in domain-specific applications where real-world data may be scarce or imbalanced. By enabling detailed control over user and question categories and their empirical distributions, DataMorgana aims to better mirror diverse user interactions observed in deployed RAG systems, ensuring thorough assessment of RAG performance across a spectrum of query and user types (Filice et al., 22 Jan 2025).
1. System Architecture and Two-Stage Generation Pipeline
DataMorgana operates via a two-stage generation pipeline, prioritizing both customization and computational efficiency:
- Stage A: Configuration. Practitioners define the “shape” of expected RAG-system traffic by specifying user categorizations (e.g., expertise levels such as “novice” or “expert”; occupational roles such as “researcher” or “public-health official”) and question categorizations (e.g., question type, phrasing style, degree of factuality, linguistic distance). These are encoded in a single JSON file, with mutually exclusive category definitions, each accompanied by a name, natural-language description, and target probability . No pre-processing of the underlying corpus, knowledge graph construction, or heavy pipeline optimization is required.
- Stage B: Generation. For each required Q&A pair, the tool:
- Samples one category from each user-level and question-level categorization according to the defined categorical distribution ().
- Samples a document from the RAG corpus.
- Synthesizes a prompt including both the relevant category descriptions and the input document (or excerpt), and instructs the LLM (e.g., Claude-3.5 Sonnet) to generate diverse candidate Q&A pairs in JSON format.
- Filters the resulting candidates for both faithfulness to the document and compliance with category constraints, and randomly selects a single valid pair to add to the benchmark.
Single-pass prompt orchestration (with optional lightweight filtering) distinguishes DataMorgana from prior multi-step or graph-building solutions, enabling efficient generation and rapid iteration (Filice et al., 22 Jan 2025).
2. Configuration Model and Sampling Logic
The DataMorgana configuration model is centered on a set of independent categorizations for both user and question characteristics. Each categorization is represented as a finite set , where each has:
Name ,
Natural-language description ,
Target probability , such that .
Sampling proceeds independently for each categorization using a categorical distribution:
The joint distribution over multiple categorizations is thus the product of their marginals, yielding a combinatorial space of user–question category intersections that drives benchmark diversity. The explicit configuration is specified in a structured JSON schema, directly linking design intent and synthetic data characteristics. For example, a “Phrasing” categorization may include entries such as:
1 2 3 4 5 6 7 |
{
"name": "Phrasing",
"categories": [
{"name": "concise-and-natural", "probability": 0.30, "description": "..."},
...
]
} |
3. Diversity Metrics and Benchmark Evaluation
DataMorgana benchmarks are evaluated across three principal axes of diversity: lexical, syntactic, and semantic. These are quantified using established corpus-level metrics:
Lexical Diversity
- N-Gram Diversity (NDG):
Higher NDG indicates greater word-sequence variability. - Self-Repetition Score (SRS):
Lower values are preferable. - Word-Compression Ratio (word-CR): The ratio of corpus size to gzip-compressed size; lower ratios reflect higher diversity and reduced redundancy.
Syntactic Diversity
- PoS-Compression Ratio (PoS-CR): Analogous to word-CR, but applied to the sequence of Part-of-Speech tags. Lower PoS-CR values indicate more varied syntactic constructions.
- Semantic Diversity
- Embeddings-Homogenization Score (emb-HS):
Calculated using all-MiniLM-L6-v2. Lower scores indicate more semantically distinct questions.
Ablation studies demonstrate that exclusions of question categorizations significantly degrade all diversity metrics, confirming their primary role in diversity maximization (Filice et al., 22 Jan 2025).
4. Experimental Validation and Comparative Results
DataMorgana has been quantitatively and qualitatively benchmarked against leading baselines using both domain-specific and open-domain corpora:
- CORD-19 (healthcare): 147 documents, yielding 2,019 Q&A pairs.
- Wikipedia (general knowledge): 2,889 passages.
The following methods were compared, controlled for identical prompt budgets and LLM versions: - Vanilla: single static prompt, no category control, - Know Your RAG: three fixed question types via a three-step LLM pipeline, - DeepEval: one-step evolutionary diversification with seven random transformations.
On CORD-19, DataMorgana achieved superior or best-in-class results:
- NDG = 2.536 (vs. 1.517 Vanilla, 2.415 DeepEval)
- SRS = 0.372 (vs. 0.920 Vanilla, 0.644 DeepEval)
- word-CR = 3.701 (vs. 5.576 Vanilla, 3.535 DeepEval)
- PoS-CR = 5.583 (vs. 7.861 Vanilla, 5.885 DeepEval)
- emb-HS = 0.249 (vs. 0.301 Vanilla, 0.251 DeepEval)
On Wikipedia, DataMorgana led in NDG and PoS-CR and matched or improved upon semantic diversity metrics.
These results collectively indicate DataMorgana’s ability to systematically improve the breadth of benchmark coverage compared to static and evolutionary diversification strategies (Filice et al., 22 Jan 2025).
5. Usage Workflow and Practical Customization
DataMorgana is designed for ease of integration into RAG system development and evaluation pipelines. The typical workflow consists of:
- Preparing the RAG corpus (plain text or JSONL).
- Authoring a JSON configuration that encodes user and question categorizations and their target distributions.
- Executing the tool from the command line:
1 2 3 4 5 6 7
datamorgana generate \ --config config.json \ --corpus path/to/corpus.jsonl \ --LLM claude-3.5 \ --k 3 \ --num-per-doc 1 \ --output synthetic_benchmark.jsonl - Inspecting and optionally filtering the output.
Key parameters include:
--k: Number of candidate Q&A pairs per prompt (default 3).--num-per-doc: Number of Q&A pairs generated per document.--seed: Random seed for reproducibility.--filter-thresholds: Semantic similarity cutoffs for post-filtering.
Benchmark customizations are recommended for domain-specific deployments, such as organizing user roles (“doctor,” “patient,” etc.) and calibrating probability mass to reflect anticipated traffic (e.g., higher prevalence of factoid queries). Iterative refinement of category descriptions via the JSON–prompt cycle is strongly supported by DataMorgana’s lightweight design (Filice et al., 22 Jan 2025).
6. Significance and Prospects
By unifying expressive, interpretable configurations with an efficient one-pass LLM generation and filtering paradigm, DataMorgana enables rapid production of synthetic Q&A benchmarks that capture varied and realistic user–question distributions in RAG contexts. The empirical evidence supports its state-of-the-art performance in maximizing lexicosemantic and syntactic diversity while remaining computationally lightweight and flexible (Filice et al., 22 Jan 2025).
DataMorgana was made available to select research groups as beta testers (notably in the context of the SIGIR'2025 LiveRAG challenge). Its methodologically transparent design and robust evaluation activity suggest it may serve as a foundational tool for both benchmarking and advancing model robustness and adaptation in retrieval-augmented generation research.