InPars Toolkit: Synthetic IR Data Pipeline

Updated 20 August 2025

InPars Toolkit is a unified pipeline that generates synthetic queries using LLMs for training neural information retrieval models.
It offers modular components for query generation, filtering, negative mining, reranker training, and evaluation, all compatible with GPU and TPU hardware.
The toolkit integrates with IR libraries like Pyserini and ir_datasets, enabling reproducible research and extensive benchmarking across IR tasks.

The InPars Toolkit is a unified, end-to-end pipeline facilitating the generation of synthetic training data for neural information retrieval (IR) using LLMs. Designed to overcome data scarcity in IR tasks, it enables reproducible research and flexible experimentation by providing modular components for query generation, filtering, negative mining, reranker training, and evaluation. Its architecture consolidates earlier advances, supports GPU and TPU hardware, and integrates seamlessly with widely adopted IR libraries and benchmarks.

1. Unified Architecture and Design Principles

The InPars Toolkit centralizes synthetic data generation for IR, reorganizing earlier methods—in particular, InPars (Bonifacio et al., 2022), InPars-v2 (Jeronymo et al., 2023), and InPars-Light (Boytsov et al., 2023)—into a reproducible pipeline. The toolkit harmonizes distinct research codebases, offering:

Core modules for synthetic query generation, token-probability/batch filtering, and negative mining;
Training and evaluation scripts for various reranker models;
Plug-and-play configuration to swap LLMs (for generation) and reranker architectures;
Compatibility with both TPU and GPU infrastructure, promoting wide usability;
Tight integration with IR libraries such as Pyserini and ir_datasets, facilitating access to the BEIR benchmark (Abonizio et al., 2023).

Toolkit users configure prompt style (e.g., "Vanilla", "GBQ", or Promptagator-inspired), query generator, number of few-shot exemplars, and filtering/run parameters. This modularity enables adaptation to diverse IR domains and systematic evaluation across hardware and model scales.

2. Synthetic Data Generation Methods

At its core, the toolkit leverages LLMs for few-shot synthetic query generation. For each corpus document $d$ and prompt template $t$ composed of $N$ (query, document) pairs, the LLM $G$ is conditioned to produce a plausible query:

$q \sim G(t, d)$

Query generation quality is assessed via the average log token probability:

$p_q = \frac{1}{|q|} \sum_{i=1}^{|q|} \log p(q_i \mid t, d, q_{<i})$

Top- $K$ pairs (by $p_q$ or reranker score) are retained as positive examples. Negative examples are mined by retrieving alternative documents with BM25 for each synthetic query, yielding $(q, d^+, d^-)$ training triples.

Notable variants within the toolkit (see table) reflect methodological innovations in model choice, prompting, and selection/filtering mechanisms:

Variant	Generator Model	Filtering Method
InPars	Proprietary LLMs (GPT-3)	Token likelihood $p_q$
InPars-v2	GPT-J (open)	Pretrained monoT5-3B reranker
InPars-Light	BLOOM (open)	Consistency checking

3. Extensions and Enhancements

The toolkit has been extended in several recent studies (Krastev et al., 19 Aug 2025) to reduce filtering overhead and further optimize generated queries:

Contrastive Preference Optimization (CPO): Fine-tunes the LLM to distinguish between high- and low-quality queries for a given document, promoting queries assigned higher relevance by a composite score:

$s(\text{doc}, \text{query}) = 0.5 \cdot s_{\text{enc}}(\text{doc}, \text{query}) + 0.5 \cdot s_{\text{BM25}}(\text{doc}, \text{query})$

The optimization loss combines a contrastive log-ratio (upper bound to DPO) and a behavior cloning regularizer:

$\mathcal{L}(\pi_\theta; U) = - \mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} [ \log \sigma ( \beta \log \pi_\theta(y_w|x) - \beta \log \pi_\theta(y_l|x) ) ] + \mathcal{L}_{\text{NLL}}$

$\mathcal{L}_{\text{NLL}} = \mathbb{E}_{(x, y_w) \sim \mathcal{D}} [ -\log \pi_\theta(y_w|x) ]$

Dynamic Chain-of-Thought (CoT) Prompting with DSPy: Replaces static templates by dynamically optimized, agent-style prompts, instructing the LLM to decompose the document into reasoning steps before query synthesis. This reduces query noise and mitigates dependence on heavy post-generation filtering.

Empirically, these enhancements reduced required candidate pools and improved downstream retrieval metrics such as nDCG, especially when paired with modern LLMs (e.g., Llama 3.1 8B) (Krastev et al., 19 Aug 2025).

4. Integration with Information Retrieval Ecosystem

The toolkit facilitates standardized, large-scale experimentation by directly interfacing with IR data and indexing tools:

Pyserini: Provides flat indexes and first-stage BM25 retrievals essential for negative mining and pipeline benchmarking.
ir_datasets: Enables direct loading of established IR test collections (e.g., all 18 datasets in BEIR), supporting reproducible evaluations.
Flexible reranker support: monoT5 (220M and 3B), DeBERTA v3 (435M), and MiniLM (30M) are supported, all configurable via CLI flags.

This integration ensures that toolkit-generated synthetic data and models are immediately compatible with widely used evaluation protocols.

5. Performance, Efficiency, and Accessibility

Across numerous benchmarks, rerankers fine-tuned on InPars synthetic data outperform classical BM25 and match or surpass alternative data augmentation/zero-shot transfer methods:

InPars (Bonifacio et al., 2022): monoT5-220M achieves MRR@10 0.2585 on MS MARCO dev (BM25 baseline 0.1874); monoT5-3B yields 0.2967.
InPars-Light (Boytsov et al., 2023): MiniLM-L6-30M (30M parameters) with three-shot BLOOM prompting delivers mean nDCG/MRR gains of 7–30% over BM25, with low inference cost (re-ranking only top 100 documents).
InPars-v2 (Jeronymo et al., 2023): End-to-end BM25 + monoT5-3B pipeline finetuned on open-source synthetic data attains state-of-the-art results across BEIR.

Open-sourcing every component—including synthetic datasets for 18+ benchmarks, fine-tuned model weights, and reproducible pipeline scripts—lowers the barrier for further research and enables verification, extension, and domain adaptation at scale (Abonizio et al., 2023).

6. Future Directions and Research Implications

The toolkit’s extensibility has facilitated rapid advances and points toward several unresolved or promising avenues:

Further scaling up synthetic data generation using ever-larger or more domain-specialized LLMs;
Reduction of filtering expense via in-situ query optimization (as with CPO and dynamic CoT prompting);
Application to contrastive and cross-modal retrieval paradigms by adapting negative mining and scoring strategies;
Use of “bad questions” or adversarial pairs as negative or pretext tasks for robust model training;
Democratization of IR research via open releases of both data and models applicable to low-resource topics or emerging domains (e.g., TREC-COVID) (Krastev et al., 19 Aug 2025).

A plausible implication is that modular, self-contained synthetic data pipelines like InPars will play a central role in future IR research, especially as new LLM architectures and data-hungry retrieval modes proliferate.

7. Availability and Open Science Impact

All InPars Toolkit resources are made fully public, including:

Source code, documentation, and utilities: https://github.com/zetaalphavector/InPars
Synthetic query sets and fine-tuned reranker checkpoints encompassing all BEIR collections (requiring >2,000 GPU hours to produce) (Abonizio et al., 2023)
Extended pipelines, CPO-enhanced generators, and DSPy prompt code: https://github.com/danilotpnta/IR2-project (Krastev et al., 19 Aug 2025)
Lightweight adaptations and further reproducibility scripts: https://github.com/searchivarius/inpars_light/

This transparency makes the toolkit a reference infrastructure for IR researchers seeking reproducibility, extensibility, and modular experimentation in synthetic data–driven neural IR.