InPars-v2: Open-Source Synthetic IR Pipeline

Updated 9 December 2025

InPars-v2 is a modular framework that leverages open-source LLMs and neural rerankers to generate synthetic query–document pairs.
It features a multi-stage pipeline including generation, filtering with MonoT5-3B, and GPU-optimized processing for reproducible IR experiments.
The toolkit enhances transparency and performance by releasing full code, data, and pretrained models for benchmark evaluations such as BEIR.

InPars-v2 is an open-source, modular framework and dataset generator that leverages LLMs and neural rerankers to create high-quality synthetic query–document pairs for neural information retrieval (NIR) training and evaluation. Developed as an extension and refinement of the original InPars pipeline, InPars-v2 aligns the synthetic data generation process with the goals of transparency, reproducibility, and accessibility, supporting research on the BEIR benchmark and beyond by enabling GPU-centric workflows, extensive customization, and full release of code, data, and pretrained models (Jeronymo et al., 2023, Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

1. Motivation and Overview

Traditional neural IR pipelines are limited by the scarcity of annotated training data, especially in low-label and domain-specific settings. Prior solutions, such as InPars-v1 and Promptagator, introduced LLM-based synthetic data generation, but their reliance on proprietary models (e.g., GPT-3, FLAN) and TPU-based infrastructure constrained reproducibility and accessibility. InPars-v2 addresses these limitations by:

Replacing closed-source commercial LLMs with open-source alternatives (notably GPT-J-6B).
Integrating a powerful neural reranker (MonoT5-3B) for post-generation filtering of synthetic pairs based on learned relevance signals.
Migrating all toolkit components to standard GPU platforms and supporting flexible integration with community-standard IR libraries (Pyserini, ir_datasets).
Consolidating synthetic data generation, filtering, reranker training, and evaluation into an end-to-end reproducible research pipeline (Jeronymo et al., 2023, Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

2. Pipeline Architecture and Data Flow

InPars-v2 operationalizes synthetic query generation for IR as a multi-stage process:

Synthetic Query–Document Generation: For each document $d$ $d$ in a target corpus $\mathcal{D}$ $D$ , an LLM (typically GPT-J-6B) is prompted with a fixed template $t$ $t$ . The default scheme is few-shot: $t$ $t$ consists of several (query, document) exemplars (usually from MS MARCO), optionally including guided negative samples ("bad questions"). Prompt configurations include:
- "inpars" (three generic, static exemplars)
- "inpars-gbq" (guided by labeling one example as "bad" and others as "good")
- Dataset-specific or user-defined templates ("promptagator", "custom")

Decoding is performed with greedy strategies by default (temperature = 1.0), and outputs are saved alongside prompt metadata and token-level probabilities (Jeronymo et al., 2023, Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

Filtering of Synthetic Pairs: Two filtering mechanisms are offered:
- Score-based: Average per-token log-probability under the generator, as in InPars-v1:
$p_n(q|t,d) = \frac{1}{|q|} \sum_i \log G(q_i \mid t, d, q_{<i})$

The top $K$ pairs by $p_n$ are retained. - Reranker-based (default for InPars-v2): A pretrained MonoT5-3B reranker, fine-tuned on MS MARCO, scores each (query, document) pair:

$S(q,d) = R(q,d; \theta_0)$

The top $K$ by $S(q,d)$ are selected.

Additional constraints eliminate overlapping queries and enforce minimum/maximum length filters (Jeronymo et al., 2023, Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

Negative Mining and Reranker Fine-tuning: For each positive (highly scored) pair, a hard negative is mined by retrieving 1,000 candidates with BM25 and randomly sampling a non-relevant document. The synthesized dataset for training thus contains $K$ positives and $K$ negatives per corpus. MonoT5-3B is then fine-tuned for a single epoch over these triples using standard pointwise loss (Jeronymo et al., 2023, Abonizio et al., 2023).
Evaluation: A two-stage retrieval—BM25 (top $k$ , typically 1000) followed by reranker scoring and reordering—outputs results for standard evaluation metrics (nDCG@10, recall@100, MAP), facilitated via Pytrec_eval and TREC-formatted outputs (Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

3. Core Algorithms and Mathematical Objectives

The central building blocks and their quantitative objectives are summarized as follows:

Synthetic Query Generation: Synthetic query $q$ for document $d$ is conditioned on prompt $t$ via:

$L_{\text{gen}} = -\mathbb{E}_{d,q^*} \sum_{i=1}^{|q^*|} \log p_\theta(q^*_i \mid t,d,q^*_{<i})$

(used for generator fine-tuning, if applicable) (Krastev et al., 19 Aug 2025).

Filtering:
- By generator: average log-probability $p_n$ .
- By reranker: learned score $s(q,d)$ from MonoT5-3B (Abonizio et al., 2023, Krastev et al., 19 Aug 2025).
Negative Mining: Retrieve BM25 negatives, one per positive query.
Reranker Training and Loss:

$L_{\text{rerank}} = -\mathbb{E}_{(q,d^+,d^-)} \left[\log \sigma( s(q,d^+) - s(q,d^-) )\right]$

or binary cross-entropy per example:

$\ell = -\left[ y \log \sigma(s) + (1-y) \log (1-\sigma(s)) \right],$

where $y\in\{0,1\}$ corresponds to the synthetic label (Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

Retrieval Metrics:
- Normalized Discounted Cumulative Gain at $k$ :
$\text{nDCG}@k = \frac{\text{DCG}_k(q)}{\text{IDCG}_k(q)},\quad \text{DCG}_k(q) = \sum_{i=1}^k \frac{2^{\text{rel}_i}-1}{\log_2(i+1)}$

where $\text{rel}_i$ is binary for SciFact and most BEIR datasets (Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

4. Empirical Results and Reproducibility

InPars-v2 achieves consistent improvements over both traditional retrieval baselines (BM25) and prior synthetic generation pipelines (InPars-v1, Promptagator) on standardized academic metrics and benchmarks:

Method	nDCG@10 (SciFact)	nDCG@10 (BEIR avg, 18 corpora)
BM25	0.678	0.424
InPars-v1	0.774	0.539
InPars-v2 (GPT-J)	0.774	0.545
monoT5-3B (MS MARCO)	-	0.533

InPars-v2 demonstrates a +9 point gain over BM25 and +6 points over InPars-v1 on average nDCG@10 across BEIR (Krastev et al., 19 Aug 2025, Jeronymo et al., 2023). Results are robust between GPU and TPU workflows, with per-dataset variance $<$ 1% (Abonizio et al., 2023).

The toolkit provides scripts and pretrained models supporting full reproducibility. All synthetic data, code, and reranker checkpoints for 18 BEIR datasets are openly released (Jeronymo et al., 2023, Abonizio et al., 2023).

5. Key Improvements over InPars-v1

The technical advances provided by InPars-v2 over its predecessor are:

Open-source generator: Transitioning from GPT-3 (proprietary) to GPT-J-6B (fully open) with HuggingFace integration, eliminating API and cost dependencies.
Neural reranker-based filtering: Use of MonoT5-3B to filter synthetic pairs achieves higher selection precision compared to generator-only metrics, resulting in improved downstream NIR model performance.
Full GPU support: Toolkit is optimized for PyTorch/Transformers on commodity hardware, removing the TPU requirement of previous iterations.
Plug-and-play modularity: All LLMs and prompt templates are replaceable via a single configuration flag; reranker architectures and filtering strategies can be customized.
Community reproducibility: Complete code, data, and models, with community-standard IR and evaluation system integration (Pyserini, ir_datasets, Pytrec_eval) (Abonizio et al., 2023, Krastev et al., 19 Aug 2025).

6. Known Limitations and Open Challenges

Several open challenges persist in the InPars-v2 framework:

Data filtering inefficiency: The 90% rejection rate (100,000 generations $\rightarrow$ top 10,000 filtered pairs) is computationally expensive and may discard useful samples. There is no ablation on $K$ or exploration of generation/filtering trade-offs (Krastev et al., 19 Aug 2025).
Static prompts: The reliance on hard-coded prompt templates may not generalize across all domains; dynamic or dataset-specific prompt strategies show promise for task transfer.
Filtering bias: Exclusive reranker-based selection may bias the synthetic dataset toward the inductive preferences of the reranker, potentially narrowing diversity.
Scalability: Large-scale query generation with GPT-J remains computationally demanding, requiring significant GPU resources (∼30 hours per 100k docs on A100).
No fine-tuning of generators: In the standard pipeline, the LLM generator is not fine-tuned to the IR task, relying solely on prompt engineering.

A plausible implication is that future work could focus on reducing generation redundancy (e.g., through adaptive prompting), benchmarking across different filtering thresholds, or combining multiple filtering objectives to mitigate reranker-induced bias (Krastev et al., 19 Aug 2025).

7. Significance and Applications

InPars-v2 establishes a reproducible, open-source benchmark for neural IR system development in low-supervision scenarios. It allows for rapid prototyping, extensive cross-evaluation, and detailed ablation studies on synthetic data quality vs. system performance, facilitating advances in both IR algorithm research and the assessment of LLMs as controllable data generators. The toolkit's extensible architecture promotes further research into filtering algorithms, prompt engineering, and domain adaptation for NIR training pipelines (Jeronymo et al., 2023, Abonizio et al., 2023, Krastev et al., 19 Aug 2025).