Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InPars-v2: Large Language Models as Efficient Dataset Generators for Information Retrieval (2301.01820v4)

Published 4 Jan 2023 in cs.IR and cs.AI

Abstract: Recently, InPars introduced a method to efficiently use LLMs in information retrieval tasks: via few-shot examples, an LLM is induced to generate relevant queries for documents. These synthetic query-document pairs can then be used to train a retriever. However, InPars and, more recently, Promptagator, rely on proprietary LLMs such as GPT-3 and FLAN to generate such datasets. In this work we introduce InPars-v2, a dataset generator that uses open-source LLMs and existing powerful rerankers to select synthetic query-document pairs for training. A simple BM25 retrieval pipeline followed by a monoT5 reranker finetuned on InPars-v2 data achieves new state-of-the-art results on the BEIR benchmark. To allow researchers to further improve our method, we open source the code, synthetic data, and finetuned models: https://github.com/zetaalphavector/inPars/tree/master/tpu

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Vitor Jeronymo (11 papers)
  2. Luiz Bonifacio (9 papers)
  3. Hugo Abonizio (12 papers)
  4. Marzieh Fadaee (40 papers)
  5. Roberto Lotufo (41 papers)
  6. Jakub Zavrel (5 papers)
  7. Rodrigo Nogueira (70 papers)
Citations (78)

Summary

  • The paper introduces a novel open-source method using LLMs and a reranking system to efficiently generate synthetic query-document pairs.
  • The study demonstrates that InPars-v2 consistently outperforms its predecessor across challenging datasets like TREC-News and Climate-FEVER on the BEIR benchmark.
  • The research promotes a cost-effective and scalable dataset generation approach, reducing reliance on proprietary models while enhancing information retrieval performance.

InPars-v2: LLMs as Efficient Dataset Generators for Information Retrieval

The research manuscript, titled "InPars-v2: LLMs as Efficient Dataset Generators for Information Retrieval," delineates an advanced methodology for data generation using LLMs to bolster the efficacy of information retrieval systems. This paper underscores the enhancement achieved by utilizing open-source LLMs and refined reranking methods to produce synthetic query-document pairs for training retrieval models, thereby establishing new state-of-the-art results on the BEIR benchmark.

The initial approach—InPars, proposed by Bonifacio et al.—leveraged few-shot learning capabilities of LLMs to generate queries from documents, which were then utilized to improve retrieval models. Both InPars and the Promptagator methodology relied on proprietary LLMs such as GPT-3 and FLAN to achieve superior dataset generation. This paper introduces InPars-v2, which transitions to open-source alternatives and sophisticated reranking systems for selecting optimal query-document pairs. This transition not only democratizes the methodology by allowing broader access to the underlying tools and data but also yields state-of-the-art performance metrics on standardized benchmarks.

InPars-v2's process involves generating synthetic queries using the GPT-J, an open-source model with substantial parameters. The generated data undergoes a filtering step using a pre-trained reranker, monoT5-3B, which ascertains the relevance of query-document pairs, thus refining the training dataset. The approach to negative sampling involves using non-relevant query-document pairs derived from BM25 outputs. The synthesized training data significantly enhances the retrieval performances when the rerankers are fine-tuned on this novel dataset structure. The comparative analysis, as detailed in the paper, shows that InPars-v2 consistently outperforms the earlier InPars model (InPars-v1) across various datasets within the BEIR benchmark suite. Notably, improvements manifest in several challenging datasets such as TREC-News and Climate-FEVER, underscoring the methodological advancements introduced.

The paper also brings to the forefront issues with proprietary reliance, highlighting the alternative of open-source solutions to generate data with comparable effectiveness. This strategic shift not only signals the potential for cost reduction in operational scalability but also serves as a testament to the evolving capabilities of open-source models in achieving sophisticated data generation at scale.

In terms of practical and theoretical implications, InPars-v2 demonstrates the viability of integrating and optimizing open-source LLMs for dataset generation, expanding the potential for their use in various domains of natural language processing and information systems. This innovation represents a significant step towards reducing dependency on costly, proprietary LLMs and underscores the necessity for continual enhancement of open-source LLM capabilities.

The development horizon for applications like InPars-v2 could encompass the fine-tuning of LLMs on increasingly sophisticated query-document pairs, thus contributing to even more precise information retrieval systems. Additionally, as this methodology gains traction, it is expected that community-driven efforts will further refine open-source models, propelling the field forward.

In conclusion, InPars-v2 provides an accessible, efficient, and effective means to employ large-scale LLMs for dataset generation, setting a new precedent in the field of information retrieval research. With the publication and open access to the code, synthetic data, and finely tuned models, broader academic and industry applications can pivot from these findings, potentially fostering a new era of information retrieval advancements grounded in open-source innovation.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com