Synthetic Query Generation
- Synthetic Relevant Queries Generation is the automated process of creating semantically aligned queries that mirror underlying data schemas and document structures.
- It combines LLM-based methods with template-driven approaches, schema-aware enumeration, and privacy-preserving techniques to enhance query diversity and quality.
- This approach is vital for training learned cost models, benchmarking retrieval engines, and enabling robust natural language interfaces in data-sensitive environments.
Synthetic relevant queries generation refers to the automated creation of artificial yet semantically meaningful queries that are statistically and structurally aligned with a given data schema, document collection, or information need. The core objective is to supply high-quality, diverse, and controllable query workloads for training, evaluating, or stress-testing downstream systems including learned cost models, retrieval engines, large-scale LLM pipelines, and natural language interfaces to databases. Synthetic queries are now pivotal in domains where labeled user queries are scarce, privacy-protected, or require extreme coverage of the semantic and operational space, such as learned query cost estimation, RAG, domain adaptation for IR, and benchmarking of structured data reasoning systems.
1. Generative Frameworks for Synthetic Query Creation
Recent work advances a variety of generative strategies, frequently combining LLMs, mechanical or template-driven generators, schema/component-aware enumeration, and feedback-driven coverage loops:
- Hybrid SDG for SQL: Combines LLM-based generation (using code-capable models such as granite-3.3-8b-instruct), mechanical templates, and SQL-component snippets. Subschema enumeration identifies all connected FK subgraphs of interest. Few-shot prompting, prompt constraints, and iterative, self-instruct strategies bias coverage towards underrepresented structural patterns and maintain join/operation correctness. Validation filters enforce syntactic and semantic correctness prior to downstream use (Nidd et al., 27 Aug 2025).
- Layer-wise and Embedding-guided Query Construction: For knowledge graphs and logic forms, frameworks like TARGA employ layer-wise expansion over entities and relations, cross-layer combination of query substructures, and semantic re-ranking via embedding-based similarity to seed questions, yielding executable queries with high relevance to given NL questions (Huang et al., 27 Dec 2024).
- CoT-augmented and Contrastive Optimization: In neural IR, prompt modules constructed via frameworks like DSPy (chain-of-thought reasoning) or contrastive preference optimization (CPO) on LLMs directly optimize the signal quality of synthetic queries so that generated queries better align with fine-grained relevance or retrieval satisfaction (Krastev et al., 19 Aug 2025).
- Privacy-Preserving QGen: Differentially private LLMs (DP-LMs) are directly fine-tuned to emit synthetic queries with formal DP guarantees. This addresses privacy in sensitive domains while maintaining retrieval quality (Carranza et al., 2023).
- Domain-specific Conditioning and Persona Modeling: In recommender systems or user intent emulation, synthetic queries are conditioned on explicit persona/constraint objects, grounded via retrieval from structured knowledge bases, and subject to strict hallucination-mitigation via post-generation validation (Banerjee et al., 12 Apr 2025).
2. Pipeline Components and Methodological Variants
A typical synthetic relevant queries generation workflow comprises the following standardized components (with implementation details varying by domain):
- Schema/Corpus Preprocessing: Extraction of schema-level metadata (e.g., tables, columns, PK/FK graphs), document features, or entity/attribute inventories.
- Subschema or Context Enumeration: Identification of all maximal/connected subgraphs or context slices (for databases, knowledge graphs, or documents).
- Prompt Construction:
- Synthetic Query Generation: Candidate queries are generated using an LLM under varying temperature or top-p settings, or via templates.
- Validation and Filtering:
- Syntax filtering: Parsing.
- Semantic validation: Domain-value checks, executability against the schema or KB, or LLM-as-judge scoring.
- Deduplication and redundancy checks to remove near-duplicates.
- Coverage Analysis and Feedback: Compute structural coverage (e.g., table/column frequency, operation distribution). Identify and address “holes” via targeted regeneration.
- Feedback Loop: Iterate prompt modification and targeted generation until diversity/coverage criteria are met.
3. Mathematical Formalization and Evaluation Metrics
Synthetic relevant queries generation is assessed using both generation-side and task-specific evaluation metrics:
- Learned Cost Modeling:
- Trained model: predicts cost of query .
- Loss:
- Q-error:
- Aggregates:
- Retrieval/QA Evaluation:
- Precision@k:
- Normalized DCG:
- System ranking: Kendall’s .
- Agreement: Cohen’s
- For RAG and generator fidelity: CSG, .
- Distributional and Structural Coverage:
- KL-divergence, plots, join-count, clause/operator coverage, per-table/column frequency.
- Privacy Guarantees:
- -DP at query or document level.
4. Experimental Findings and Comparative Gains
Synthetic relevant queries generation frameworks consistently yield significant efficiency and effectiveness improvements:
| Use case | Baseline (size/method) | Synthetic pipeline | % Data saving | Q-error / nDCG gain | Downstream improvement |
|---|---|---|---|---|---|
| Learned SQL cost | 4000 mechanical | 2200 LLM+mech | 45% | Q-error ↓4–10% | 10% E2E speedup |
| IR (reranking) | MS MARCO zero-shot | InPars-V2, 10k synth | – | nDCG@10 +0.02 | Competitive with SOTA |
| Privacy-preserving IR | DP-SGD on real | DP-LM queries | – | NDCG@10 ×3–4 higher | Retains DP guarantee |
| KBQA/semantic parsing | Non finetuned open LLM | TARGA, synth pairs | – | F1 +8–14 pts | Robust, efficient |
Key findings include:
- Subschema-guided, contextually-aware prompts eliminate join errors and improve SQL semantic integrity (Nidd et al., 27 Aug 2025, Caferoğlu et al., 30 Sep 2025).
- Pairwise or contrastively-optimized pipelines (using CPO or dual-label generation) produce harder negatives and boost zero-shot relevance modeling (Chaudhary et al., 2023, Krastev et al., 19 Aug 2025).
- CoT and modular prompt optimization (DSPy) reduces the need for aggressive filtering, yielding higher-quality queries with less postprocessing (Krastev et al., 19 Aug 2025).
5. Robustness, Bias, and Best Practice Recommendations
Empirical studies reveal important best practices and limitations:
- Diversity and Validation Over Volume: High-quality, coverage-valid synthetic sets (fewer, but more diverse and relevant queries) consistently outperform larger, naively generated corpora (Nidd et al., 27 Aug 2025).
- Schema/context-aware Prompting: Extracting sub-contexts (e.g., FK-closures, relevant columns) and conditioning generation on these subgraphs improves relevance and coverage.
- LLM-based Validation: LLM-as-judge, execution-based, or round-trip consistency checks are critical for semantic correctness, especially in text-to-SQL and structured data applications (Tiwari et al., 17 Dec 2024, Caferoğlu et al., 30 Sep 2025).
- Hybridization: Combining template and LLM-generated queries addresses both head and long-tail entity coverage (Sannigrahi et al., 10 Jun 2024).
- Cold-start Remediation: Synthetic queries can boost exposure of cold-start items in retrieval, as shown in large-scale online experiments in production search (Palumbo et al., 8 Sep 2025).
- Privacy and Security: DP-LM-based synthetic queries circumvent limitations of direct DP training and enable utility-preserving retrieval models with formal query-level privacy (Carranza et al., 2023).
- Bias Mitigation: No significant bias observed favoring LLM-matched systems in IR evaluation when prompt and corpus diversity are enforced (Rahmani et al., 13 May 2024).
6. Applications and Extensions Across Domains
Synthetic relevant query generation has found broad applicability:
- Workload bootstrapping for learned cost models, system stress-testing, and SQL engine robustness (Nidd et al., 27 Aug 2025, Caferoğlu et al., 30 Sep 2025).
- Information retrieval and ranking: augmenting or substituting manual queries and relevance labels; construction of fully synthetic test collections with verified system ranking fidelity (Rahmani et al., 13 May 2024, Abonizio et al., 2023, Krastev et al., 19 Aug 2025).
- Structured reasoning and QA: producing in-domain logic-form (SQL, Cypher, SPARQL) paired data for semantic parsing and end-to-end structured QA (Tiwari et al., 17 Dec 2024, Huang et al., 27 Dec 2024).
- Domain-specific generative retrieval: LLM bootstrapping of generative retrievers, no manual queries required (Wen et al., 25 Feb 2025).
- RAG component optimization: controlled generation of queries spanning complexity, clue completeness, and citation granularity (Shen et al., 16 May 2025).
- Intent emulation and recommender benchmarking: persona/constraint-conditioned, KB-grounded query benchmarking (Banerjee et al., 12 Apr 2025).
- Privacy-preserving system training: data utility maximization under strict differential privacy (Carranza et al., 2023).
7. Limitations and Open Challenges
Despite empirical successes, several technical limitations remain:
- Persistent label leakage and duplication in nuanced/relevance-graded QGen, requiring better generation-control objectives (Chaudhary et al., 2023, Chaudhary et al., 2023).
- Scaling automatic validation/repair in complex multi-table or multi-hop scenarios—high LLM or compute cost for feedback loops (Caferoğlu et al., 30 Sep 2025, Tiwari et al., 17 Dec 2024).
- Off-topic drift when increasing the number of variants or lack of topical context in prompts (Breuer, 6 Nov 2024).
- Syntactic and semantic errors in LLM-only pipelines if not paired with schema-aware constraints or execution-based checks (Nidd et al., 27 Aug 2025).
- Prospective risk of subtle bias or overfitting to synthetic signal if not diversified via prompt, seed, and validation strategies (Rahmani et al., 13 May 2024).
- Latency and scalability requirements in production, motivating offline/periodic rather than real-time synthetic QGen (Palumbo et al., 8 Sep 2025).
Ongoing work emphasizes adversarial/diverse seeding, prompt composition, ensemble-of-LLM creation, soft prompt optimization, and formal adversarial filtering as promising remedies. Extending these methods to multi-modal and cross-lingual settings, as well as federated or privacy-preserving learning scenarios, remains a key frontier.