Papers
Topics
Authors
Recent
2000 character limit reached

Synthetic Query Generation

Updated 19 December 2025
  • Synthetic Relevant Queries Generation is the automated process of creating semantically aligned queries that mirror underlying data schemas and document structures.
  • It combines LLM-based methods with template-driven approaches, schema-aware enumeration, and privacy-preserving techniques to enhance query diversity and quality.
  • This approach is vital for training learned cost models, benchmarking retrieval engines, and enabling robust natural language interfaces in data-sensitive environments.

Synthetic relevant queries generation refers to the automated creation of artificial yet semantically meaningful queries that are statistically and structurally aligned with a given data schema, document collection, or information need. The core objective is to supply high-quality, diverse, and controllable query workloads for training, evaluating, or stress-testing downstream systems including learned cost models, retrieval engines, large-scale LLM pipelines, and natural language interfaces to databases. Synthetic queries are now pivotal in domains where labeled user queries are scarce, privacy-protected, or require extreme coverage of the semantic and operational space, such as learned query cost estimation, RAG, domain adaptation for IR, and benchmarking of structured data reasoning systems.

1. Generative Frameworks for Synthetic Query Creation

Recent work advances a variety of generative strategies, frequently combining LLMs, mechanical or template-driven generators, schema/component-aware enumeration, and feedback-driven coverage loops:

  • Hybrid SDG for SQL: Combines LLM-based generation (using code-capable models such as granite-3.3-8b-instruct), mechanical templates, and SQL-component snippets. Subschema enumeration identifies all connected FK subgraphs of interest. Few-shot prompting, prompt constraints, and iterative, self-instruct strategies bias coverage towards underrepresented structural patterns and maintain join/operation correctness. Validation filters enforce syntactic and semantic correctness prior to downstream use (Nidd et al., 27 Aug 2025).
  • Layer-wise and Embedding-guided Query Construction: For knowledge graphs and logic forms, frameworks like TARGA employ layer-wise expansion over entities and relations, cross-layer combination of query substructures, and semantic re-ranking via embedding-based similarity to seed questions, yielding executable queries with high relevance to given NL questions (Huang et al., 27 Dec 2024).
  • CoT-augmented and Contrastive Optimization: In neural IR, prompt modules constructed via frameworks like DSPy (chain-of-thought reasoning) or contrastive preference optimization (CPO) on LLMs directly optimize the signal quality of synthetic queries so that generated queries better align with fine-grained relevance or retrieval satisfaction (Krastev et al., 19 Aug 2025).
  • Privacy-Preserving QGen: Differentially private LLMs (DP-LMs) are directly fine-tuned to emit synthetic queries with formal DP guarantees. This addresses privacy in sensitive domains while maintaining retrieval quality (Carranza et al., 2023).
  • Domain-specific Conditioning and Persona Modeling: In recommender systems or user intent emulation, synthetic queries are conditioned on explicit persona/constraint objects, grounded via retrieval from structured knowledge bases, and subject to strict hallucination-mitigation via post-generation validation (Banerjee et al., 12 Apr 2025).

2. Pipeline Components and Methodological Variants

A typical synthetic relevant queries generation workflow comprises the following standardized components (with implementation details varying by domain):

  1. Schema/Corpus Preprocessing: Extraction of schema-level metadata (e.g., tables, columns, PK/FK graphs), document features, or entity/attribute inventories.
  2. Subschema or Context Enumeration: Identification of all maximal/connected subgraphs or context slices (for databases, knowledge graphs, or documents).
  3. Prompt Construction:
    • Mechanical/n-shot: Examples are either sampled mechanically or selected to enforce diversity and steer complexity.
    • LLM-augmented: Prompts may include explicit coverage targets, constraints (e.g., clause bias), or CoT decomposition instructions.
  4. Synthetic Query Generation: Candidate queries are generated using an LLM under varying temperature or top-p settings, or via templates.
  5. Validation and Filtering:
    • Syntax filtering: Parsing.
    • Semantic validation: Domain-value checks, executability against the schema or KB, or LLM-as-judge scoring.
    • Deduplication and redundancy checks to remove near-duplicates.
  6. Coverage Analysis and Feedback: Compute structural coverage (e.g., table/column frequency, operation distribution). Identify and address “holes” via targeted regeneration.
  7. Feedback Loop: Iterate prompt modification and targeted generation until diversity/coverage criteria are met.

3. Mathematical Formalization and Evaluation Metrics

Synthetic relevant queries generation is assessed using both generation-side and task-specific evaluation metrics:

  • Learned Cost Modeling:
    • Trained model: C(q;θ)C(q; \theta) predicts cost of query qq.
    • Loss: L(θ)=1Nq(C(q;θ)ttrue(q))2L(\theta) = \frac{1}{N} \sum_{q} ( C(q; \theta) - t_{\text{true}}(q) )^{2}
    • Q-error: Q(q)=max(C(q;θ)/ttrue(q),ttrue(q)/C(q;θ))Q(q) = \max( C(q;\theta)/t_{\text{true}}(q), t_{\text{true}}(q)/C(q;\theta) )
    • Aggregates: qmedian,qmean,qp95q_{\text{median}},q_{\text{mean}},q_{p95}
  • Retrieval/QA Evaluation:
    • Precision@k: P@k=1ki=1kI{reli>0}P@k = \frac{1}{k}\sum_{i=1}^{k} \mathbb{I}\{\mathrm{rel}_i>0\}
    • Normalized DCG: nDCG@k=DCG@kIDCG@k\mathrm{nDCG}@k=\frac{\mathrm{DCG}@k}{\mathrm{IDCG}@k}
    • System ranking: Kendall’s τ\tau.
    • Agreement: Cohen’s κ\kappa
    • For RAG and generator fidelity: CSG, Fgen=13(Fcomp+Funder+Fcite)F_{\mathrm{gen}} = \frac{1}{3}(F_{\mathrm{comp}} + F_{\mathrm{under}} + F_{\mathrm{cite}}).
  • Distributional and Structural Coverage:
    • KL-divergence, χ2\chi^{2} plots, join-count, clause/operator coverage, per-table/column frequency.
  • Privacy Guarantees:
    • (ϵ,δ)(\epsilon,\delta)-DP at query or document level.

4. Experimental Findings and Comparative Gains

Synthetic relevant queries generation frameworks consistently yield significant efficiency and effectiveness improvements:

Use case Baseline (size/method) Synthetic pipeline % Data saving Q-error / nDCG gain Downstream improvement
Learned SQL cost 4000 mechanical 2200 LLM+mech 45% Q-error ↓4–10% 10% E2E speedup
IR (reranking) MS MARCO zero-shot InPars-V2, 10k synth nDCG@10 +0.02 Competitive with SOTA
Privacy-preserving IR DP-SGD on real DP-LM queries NDCG@10 ×3–4 higher Retains DP guarantee
KBQA/semantic parsing Non finetuned open LLM TARGA, synth pairs F1 +8–14 pts Robust, efficient

Key findings include:

5. Robustness, Bias, and Best Practice Recommendations

Empirical studies reveal important best practices and limitations:

  • Diversity and Validation Over Volume: High-quality, coverage-valid synthetic sets (fewer, but more diverse and relevant queries) consistently outperform larger, naively generated corpora (Nidd et al., 27 Aug 2025).
  • Schema/context-aware Prompting: Extracting sub-contexts (e.g., FK-closures, relevant columns) and conditioning generation on these subgraphs improves relevance and coverage.
  • LLM-based Validation: LLM-as-judge, execution-based, or round-trip consistency checks are critical for semantic correctness, especially in text-to-SQL and structured data applications (Tiwari et al., 17 Dec 2024, Caferoğlu et al., 30 Sep 2025).
  • Hybridization: Combining template and LLM-generated queries addresses both head and long-tail entity coverage (Sannigrahi et al., 10 Jun 2024).
  • Cold-start Remediation: Synthetic queries can boost exposure of cold-start items in retrieval, as shown in large-scale online experiments in production search (Palumbo et al., 8 Sep 2025).
  • Privacy and Security: DP-LM-based synthetic queries circumvent limitations of direct DP training and enable utility-preserving retrieval models with formal query-level privacy (Carranza et al., 2023).
  • Bias Mitigation: No significant bias observed favoring LLM-matched systems in IR evaluation when prompt and corpus diversity are enforced (Rahmani et al., 13 May 2024).

6. Applications and Extensions Across Domains

Synthetic relevant query generation has found broad applicability:

7. Limitations and Open Challenges

Despite empirical successes, several technical limitations remain:

Ongoing work emphasizes adversarial/diverse seeding, prompt composition, ensemble-of-LLM creation, soft prompt optimization, and formal adversarial filtering as promising remedies. Extending these methods to multi-modal and cross-lingual settings, as well as federated or privacy-preserving learning scenarios, remains a key frontier.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Synthetic Relevant Queries Generation.