Promptagator Methodology

Updated 9 December 2025

Promptagator is a synthetic data generation methodology that uses advanced LLM prompting and custom FLAN templates to create high-quality (query, document) pairs.
It integrates consistency-based filtering to retain semantically stable queries, yielding improved nDCG scores (+0.112 over BM25) compared to earlier InPars methods.
The pipeline fine-tunes a robust neural reranker (monoT5-3B) on filtered pairs, achieving competitive performance on benchmarks like SciFact and BEIR.

Promptagator is a synthetic data generation methodology for neural information retrieval (NIR) that leverages LLMs to create high-quality (query, document) supervision. Building upon the InPars and InPars-v2 pipelines, Promptagator introduces advanced prompting and filtering strategies, including a proprietary LLM as the generator, custom prompt templates based on the FLAN framework, and consistency-based filtering. This approach has demonstrated competitive retrieval performance in benchmark tasks while introducing distinct engineering and methodological refinements compared to its predecessors (Jeronymo et al., 2023, Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

1. Origin and Motivation

Promptagator was developed in the context of synthetic data generation for neural IR systems suffering from a lack of annotated (query, document) pairs. Earlier work, such as InPars-v1, used GPT-3 with static prompt templates to generate relevant queries for given documents. Subsequently, InPars-v2 improved reproducibility and openness by adopting the GPT-J-6B open-source model and switching to reranker-based filtering, but still suffered from inefficiencies due to aggressive filtering and relied on static templates. Promptagator aimed to address these limitations by improving the quality of query generation and enhancing the flexibility and expressiveness of prompt templates (Krastev et al., 19 Aug 2025).

2. Pipeline Architecture and Distinctive Components

Promptagator adopts an LLM-centric synthetic data pipeline with several unique elements:

LLM Generator: Uses a proprietary, large-capacity LLM (137B parameters, e.g., FLAN family) to generate queries from input documents.
Prompt Engineering: Employs custom prompt templates that may differ from static InPars “vanilla” or “GBQ” templates. These templates are based on FLAN-style demonstrations that condition the LLM with more varied and elaborate context.
Consistency-Based Filtering: Rather than relying solely on log-probability or reranker scores, Promptagator applies “consistency filtering,” i.e., retaining synthetic queries for which repeated generations are semantically similar or meet specific stability criteria.
Neural Reranker: Like InPars, Promptagator fine-tunes a powerful reranker (typically monoT5-3B) on the filtered synthetic pairs.

The typical workflow is as follows:

For each document $d$ , a custom FLAN template prompt is used to generate a candidate query $q$ using the proprietary LLM.
Multiple generations per document may be performed; only queries achieving high intra-generation consistency are retained.
The final (query, document) pairs are used to fine-tune a monoT5 reranker, which is then deployed in a two-stage pipeline: sparse retrieval using BM25, followed by re-ranking of the candidate list.

3. Filtering, Scoring, and Mathematical Objectives

Promptagator introduces “consistency filtering” to augment or replace relevance or log-probability filtering. The core premise is to filter synthetic queries by their consistency across multiple generations, favoring outputs that are stable and semantically robust under prompt re-sampling. This contrasts with previous approaches:

InPars-v1: Filtering by average per-token log-probability under the generator LLM.
InPars-v2: Filtering by a neural reranker’s relevance score $s(d, q)$ .
Promptagator: Filtering by semantic or generation consistency using custom FLAN prompts, potentially yielding higher-quality, lower-variance synthetic queries.

The downstream model fine-tuning objective remains the same as in prior work. For reranker $s_\theta(d, q)$ , fine-tuning is performed using the standard binary cross-entropy loss: $\mathcal{L}_{CE} = -\sum_q \left[\log \sigma(s_\theta(d^+,q)) + \sum_{d^-} \log(1 - \sigma(s_\theta(d^-,q))) \right],$ where $\sigma$ is the sigmoid function, $d^+$ denotes positive documents, and $d^-$ denotes negatives (Krastev et al., 19 Aug 2025).

Evaluation employs nDCG@k (Normalized Discounted Cumulative Gain) and Recall@k: $\text{nDCG@k} = \frac{1}{|Q|}\sum_{q\in Q} \frac{1}{\text{IDCG}_q} \sum_{i=1}^k \frac{2^{\text{rel}_i}-1}{\log_2(i+1)}$

$\text{Recall@k} = \text{fraction of queries for which a relevant document is in the top-}k$

4. Empirical Performance and Benchmarking

Promptagator has demonstrated competitive performance on standard IR benchmarks. As reported on the SciFact dataset, Promptagator achieves nDCG@10 of 0.790 (+0.112 over BM25). This result leverages the custom FLAN prompt template and consistency filtering. The table below summarizes nDCG@10 scores for several methods on SciFact (Krastev et al., 19 Aug 2025):

Method	nDCG@10	Δ vs BM25
BM25 (sparse)	0.678	–
InPars-V1	0.774	+0.096
InPars-V2	0.774	+0.096
Promptagator*	0.790	+0.112

*Promptagator uses a custom FLAN template and consistency filtering. Reproduction studies confirmed these results within ±0.01 nDCG.

Promptagator’s performance across other BEIR datasets is generally on par with, or exceeds, that of open-source InPars-v2—despite relying on a closed, proprietary LLM (Jeronymo et al., 2023, Abonizio et al., 2023).

5. Reproducibility, Limitations, and Extensions

Promptagator was initially limited by the absence of released code and the need for closed, high-capacity LLMs, which restricted accessibility and reproducibility in research environments (Abonizio et al., 2023). In contrast, InPars-v2 and the InPars Toolkit prioritize open-source LLMs (e.g., GPT-J-6B), modularity, and plug-and-play prompting/filtering strategies.

Promptagator-type methods typically require:

Substantial computational resources due to large-scale generation and repeated consistency checks.
Domain expertise to craft effective, semantically rich FLAN prompt templates.
Continued reliance on closed, proprietary models for generation, although open-source alternatives are being explored (Krastev et al., 19 Aug 2025).

Subsequent research, including InPars+ and InPars-light, extends the Promptagator methodology by introducing techniques such as:

Generator fine-tuning via Contrastive Preference Optimization (CPO) to improve query quality.
Replacement of static templates with dynamic, Chain-of-Thought (CoT) prompts using frameworks like DSPy, reducing the need for aggressive filtering (Krastev et al., 19 Aug 2025).

6. Relation to Adjacent Synthetic Data Methods

Promptagator is part of a broader trend in NIR towards LLM-driven synthetic data pipelines. Its distinctive usage of FLAN-style prompts and consistency-based filtering sharply contrasts with the log-prob or reranker-centric selection criteria of InPars-v1 and InPars-v2. Nevertheless, empirical comparisons on benchmarks such as BEIR and SciFact indicate marginal performance differentials between advanced Promptagator pipelines and open, reranker-filtered pipelines using recent LLMs.

A plausible implication is that improvements driven by prompt engineering and filtering are nearing saturation for certain tasks, and further progress may depend on more advanced generator fine-tuning or dynamic prompt learning (Krastev et al., 19 Aug 2025).

7. Summary Table: Methodological Differences

Method	LLM Generator	Prompt Strategy	Filtering Criterion
InPars-V1	GPT-3 (closed)	Vanilla/GBQ	Log-prob per token
InPars-V2	GPT-J-6B	Same as V1	MonoT5-3B reranker score
Promptagator	FLAN (closed)	Custom FLAN template	Consistency-based filtering
InPars+	LLM, CPO-tuned	Dynamic CoT (DSPy)	Minimized filtering needed

This organization clarifies the evolutionary path from static, log-probability-based pipelines to Promptagator’s flexible, generative, and consistency-based approach, and finally towards dynamic, model-driven prompt optimization.