Synthetic Query Generation

Updated 9 September 2025

Synthetic Query Generation is an automated process that uses language models, templates, and algorithms to create query instances for neural information retrieval and semantic parsing.
The approach employs methods such as semantic parsing with sketch-based completion, LLM-driven prompting, and adaptive concept coverage to generate schema-consistent queries.
Evaluation focuses on retrieval metrics, computational efficiency, and privacy guarantees, demonstrating its practical impact in database querying, recommendation systems, and virtual assistants.

Synthetic query generation refers to the automated process of producing query instances—typically in natural language, SQL, or other formal query languages—using programmatic methods (e.g., LLMs, templates, or synthesis algorithms) rather than human annotation. This technique serves a central role in the development of neural information retrieval systems, cross-dialect semantic parsing, privacy-preserving training pipelines, and zero-shot/low-resource domain adaptation. The recent literature demonstrates a broad array of methodologies, spanning natural language to SQL synthesis, task-adaptive query generation for diverse search intents, large-scale synthetic data construction for ranking models, and LLM-based query augmentation for cold-start recommendation scenarios.

1. Core Methodologies in Synthetic Query Generation

The approaches to synthetic query generation can be categorized by their underlying technical principles, which include semantic parsing with sketch-based completion, template-driven generation aided by LLMs, direct preference optimization with retrieval-aware objectives, and concept coverage-driven adaptive generation.

Semantic Parsing and Sketch-Based Completion:

Methods such as that in "Type- and Content-Driven Synthesis of SQL Queries from Natural Language" parse a user’s utterance into a query sketch—i.e., an incomplete program with placeholders for unknown tables, columns, or predicates. Type-directed program synthesis then fills these holes by efficiently enumerating schema-consistent completions, scored both by lexical similarity (e.g., via Word2Vec) and by the empirical evidence in the underlying database. A key feature is a synthesize–repair loop: if no high-confidence candidate matches, the system localizes faults in the sketch and performs repairs (for example, splitting composite predicates or adding join operations) before re-attempting synthesis (Yaghmazadeh et al., 2017).

LLM-Based Query Generation with Task-Adaptive Prompting:

Modern frameworks condition LLMs on document, schema, or metadata representations—often with task-specific or domain-specific prompting. Recent work (e.g., InPars-V2, Promptagator, AudioBoost) demonstrates both static template prompts and dynamic chain-of-thought optimization (such as via DSPy), often augmented with in-context demonstrations or behavioral constraints (Krastev et al., 19 Aug 2025, Palumbo et al., 8 Sep 2025).

Preference and Ranking Signal Alignment:

Direct Preference Optimization (DPO) and Contrastive Preference Optimization (CPO) fine-tune the query generator so that query–document pairs align with desired ranking scores. This is achieved by constructing preference datasets from model-assigned or external signals, and optimizing a preference-based or contrastive loss that rewards queries yielding better retrieval performance (Coelho et al., 25 May 2025, Krastev et al., 19 Aug 2025).

Concept Coverage and Adaptive Generation:

Approaches such as CCQGen explicitly identify the core concepts (topics and phrases) in a given document, then generate multiple queries adaptively—each query is conditioned on underrepresented concepts not well covered by prior queries, resulting in nonredundant, broad coverage of document content (Kang et al., 16 Feb 2025).

2. Practical Implementation Modalities

Synthetic query generation spans several practical implementations:

Modality	Key Details/Constraints	Representative Papers
Program Synthesis with Sketches	Tree-structured sketch, type-directed rules, confidence scoring	(Yaghmazadeh et al., 2017)
LLM-Driven Prompt-Based Generation	Static/dynamic prompts; CoT; domain adaptation	(Krastev et al., 19 Aug 2025, Palumbo et al., 8 Sep 2025, Sannigrahi et al., 10 Jun 2024)
Differential Privacy-Constrained Gen.	DP-fine-tuned LMs, post-processing, query-level privacy	(Carranza et al., 2023)
Preference-Optimized Generation	DPO/CPO with ranker feedback, reward shaping	(Coelho et al., 25 May 2025, Krastev et al., 19 Aug 2025)
Adaptive/Concept-Guided Generation	Sequential conditioning on concept coverage	(Kang et al., 16 Feb 2025)
Template-Filling with LLM/Manual Mix	DSL or Cypher/SQL templates, LLMs fill slots, verification	(Rebei, 2023, Zhong et al., 15 Jun 2024, Tiwari et al., 17 Dec 2024)

Sketch-Based and Type-Driven Synthesis: As in Sqlizer, the workflow begins with semantic parsing to produce a generalizable (schema-agnostic) program skeleton or sketch, recurses through knowledge of schema/type constraints, and integrates content-based scoring. Fault localization with subsequent repair provides robustness against mismatches between user intent and the actual schema. Computational efficiency is achieved due to early pruning via type and content checks.
LLM-Based Generation: Approaches leverage contextual prompts embedding document summaries, metadata, or taxonomies, and may use dynamic programmatic agents (DSPy) for constructing chain-of-thought optimized templates. Robustness comes from pairing prompt engineering with filtering strategies (consistency checks, retrievability constraints, or behavioral cloning).
Preference Alignment: Both DPO and CPO use retrieval metrics (cross-encoder scores, actual retrieval ranking, or even LLM-based listwise assessments) as reward signals for generator fine-tuning. The generator's output is thus optimized explicit for the retrieval objective, reducing the necessity of post hoc filtering and improving query succinctness and alignment with user intent.
Differentially Private Synthesis: Here, the query generation phase is privatized by applying DP-SGD-style gradient clipping and noise addition during LM fine-tuning, and only post-processed outputs (synthetic queries) are used for downstream system training—ensuring query-level (ϵ, δ)-DP (Carranza et al., 2023).
Concept Coverage Control: Approaches such as CCQGen iteratively select under-covered concepts from the output coverage vector I, sampling queries that specifically target these topics or phrases. This reduces lexical and conceptual redundancy, especially in the “long tail” of document representations.

3. Evaluation Metrics and Empirical Performance

Evaluation of synthetic query generation methods commonly relies on metrics reflecting retrieval effectiveness, generation quality, computational efficiency, and practical viability:

Retrieval Metrics:

NDCG@k, Recall@k, MAP, and Precision@k are universally reported. Improvements of up to +8.6 NDCG@10 (pairwise generation vs. single-query, (Chaudhary et al., 2023)) and performance on par or better than transfer-learned models (Promptagator, DUQGen) are claimed.

Quality and Consistency:

Filtering protocols include measuring the retention rate (fraction of synthetic queries retrieving their target document), round-trip consistency (retaining only queries whose intent is preserved), and reward-based selection using ranker outputs.

Computational Cost:

Reported synthesis times include sub-2s per query for well-optimized systems (e.g., Sqlizer (Yaghmazadeh et al., 2017)), ability to generate high-quality queries on moderate resources (single A100 GPU for LoRa-fine-tuned LLMs (Rebei, 2023)), and scalable generation over millions of document-query pairs (Krastev et al., 19 Aug 2025).

Privacy and Robustness:

DP-enforced methods report empirical privacy via canary exposure and theoretical (ϵ, δ) guarantees; performance is traded off with injected noise (Carranza et al., 2023). Robustness is further examined through ablation studies (e.g., effect of prompt domain, clustering strategies (Chandradevan et al., 3 Apr 2024)).

Metric	Typical Range	System/Approach
NDCG@10 (zero-shot IR)	0.23–0.37	DPO/CPO-based (Coelho et al., 25 May 2025), InPars+ (Krastev et al., 19 Aug 2025)
SQL Query Accuracy	>80% (dialect-specific)	LoRa-fine-tuned LLMs (Rebei, 2023, Pourreza et al., 22 Aug 2024)
Privacy parameter ϵ	3–5 (query-level DP)	DP-LM generation (Carranza et al., 2023)
Query synthesis time	1–2 s per query (Sqlizer)	(Yaghmazadeh et al., 2017)

4. Applications Across Domains

Synthetic query generation underpins advancements in a variety of application domains:

Information Retrieval and Ranking:

Synthetic queries supplement or replace user-logged queries for neural ranker training in general web search (Coelho et al., 25 May 2025), scientific document search (Kang et al., 16 Feb 2025), and domain adaptation settings (DUQGen (Chandradevan et al., 3 Apr 2024)). Inclusion of pairwise and concept-adaptive queries demonstrably increases ranking effectiveness, especially under cold-start or low-resource constraints.

Cross-Dialect Semantic Parsing and SQL Generation:

Frameworks like SQL-GEN generate seed templates and expand them with LLMs conditioned on dialect-specific tutorials (e.g., PostgreSQL, BigQuery), combined with model merging strategies (Mixture-of-Experts) to support multi-dialect Text-to-SQL (Pourreza et al., 22 Aug 2024). High accuracy and generalizability are consistently maintained.

Natural Language Interfaces to Databases and KGs:

Template-filling and LLM-based pipelines (SyntheT2C, Auto-Cypher (Zhong et al., 15 Jun 2024, Tiwari et al., 17 Dec 2024)) efficiently generate Text2Cypher (KG) and Text2SQL (relational) synthetic query datasets, enabling robust fine-tuning of LLMs for database-backed QA and conversational agents.

Speech and Virtual Assistant Systems:

In domain-specific ASR and VA contexts, LLM-generated queries (often more verbose and entity-specific) complement traditional template-based coverage, enhancing recognition accuracy and supporting rare or long-tail scenarios (Sannigrahi et al., 10 Jun 2024).

Cold-Start and Unpopular Content Surfacing:

By augmenting both the index and the Query AutoComplete (QAC) system with LLM-generated descriptors, systems such as AudioBoost substantially increase retrievability and exploration for underrepresented content classes (e.g., Spotify audiobooks), as substantiated by production-scale A/B metrics (+1.82% exploratory query completions) (Palumbo et al., 8 Sep 2025).

5. Challenges, Limitations, and Future Research Directions

Ambiguity and Faithfulness:

Despite advances (e.g., label-conditioned, pairwise, and adaptive generation), ensuring that queries faithfully reflect nuanced relevance labels remains a challenge. Duplicate or insufficiently discriminative queries are a common artifact (Chaudhary et al., 2023).

Scalability and Efficiency:

While LLM-driven methods are effective, computational cost for large-scale data synthesis and model fine-tuning remains substantial, prompting continued work on parameter-efficient tuning (e.g., LoRa), sample-efficient selection (DUQGen), and synthesis-time optimization (Rebei, 2023, Chandradevan et al., 3 Apr 2024).

Robustness and Data Quality:

Effectiveness depends critically on prompt design, domain adaptation via clustering or domain-specific exemplars, and that round-trip or retrievability-based filtering is properly enforced. Biases in LLMs or synthetic metadata may propagate into the query prior, demanding careful mitigation (Sannigrahi et al., 10 Jun 2024).

Privacy:

Scaling synthetic query generation to privacy-sensitive domains is nontrivial: strict DP budgets degrade query quality, and stronger privacy (beyond query-level) requires more complex mechanisms (Carranza et al., 2023).

Generalizability:

Future research is focused on extending current pipelines to non-relational/noSQL languages, broader KGs, integrating more advanced round-trip/semantic filtering, and aligning generation with diverse user search intents (as in EGG, (Lee et al., 25 Sep 2024)).

6. Summary and Prospects

Synthetic query generation has become a cornerstone for scalable, domain-adaptive, and privacy-aware neural IR systems, robust semantic parsing across diverse query languages, and content discoverability in recommendation and virtual assistant platforms. Recent progress, characterized by hybrid pipelines leveraging advances in LLM prompting, preference alignment, type-directed synthesis, and adaptive query control, demonstrates substantial improvements in data efficiency, retrieval accuracy, and real-world impact (as substantiated by both offline and online metrics). Remaining challenges in faithfulness, privacy, scalability, and generalization continue to shape the active research landscape, with prospects for further integration of concept-driven, ranking-aware, and task-adaptive techniques across new modalities and domains.