Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with Large Language Models (2406.18740v1)

Published 26 Jun 2024 in cs.CL and cs.IR

Abstract: LLMs have been revolutionizing a myriad of natural language processing tasks with their diverse zero-shot capabilities. Indeed, existing work has shown that LLMs can be used to great effect for many tasks, such as information retrieval (IR), and passage ranking. However, current state-of-the-art results heavily lean on the capabilities of the LLM being used. Currently, proprietary, and very large LLMs such as GPT-4 are the highest performing passage re-rankers. Hence, users without the resources to leverage top of the line LLMs, or ones that are closed source, are at a disadvantage. In this paper, we investigate the use of a pre-filtering step before passage re-ranking in IR. Our experiments show that by using a small number of human generated relevance scores, coupled with LLM relevance scoring, it is effectively possible to filter out irrelevant passages before re-ranking. Our experiments also show that this pre-filtering then allows the LLM to perform significantly better at the re-ranking task. Indeed, our results show that smaller models such as Mixtral can become competitive with much larger proprietary models (e.g., ChatGPT and GPT-4).

PDF HTML Abstract

Investigating Pre-Filtering for Re-Ranking with LLMs

Introduction

The paper "Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with LLMs" explores enhancing information retrieval (IR) systems by incorporating a novel pre-filtering step before employing LLMs for passage re-ranking. The authors, Baharan Nouriinanloo and Maxime Lamothe, explore the potential for smaller, open-source LLMs to perform competitively with larger, proprietary models such as GPT-4 by filtering out irrelevant passages before the re-ranking process.

Methodology

Pre-Filtering Approach

The core concept introduced is a pre-filtering step that leverages LLMs to assign relevance scores to passages retrieved by an initial retrieval stage, typically constructed with methods like BM25. This pre-filtering step is designed to discard irrelevant passages based on a threshold relevance score, thereby reducing the noise and the number of passages passed to the re-ranking stage.

The authors employ Mixtral-8x7B-Instruct, a smaller open-source LLM, for both the pre-filtering and re-ranking processes. The pre-filtering uses a prompting strategy that incorporates Chain-of-Thought (CoT) and Plan-and-Solve (PS) reasoning methods to ensure a thorough understanding of queries and passages before generating relevance scores.

Prompt Design and Threshold Setting

The designed prompt instructs the LLM first to comprehend the query and passages and then to assign a relevance score within the range of 0 to 1. The threshold for relevance is determined using a small, human-generated subset of expert relevance scores (qrels) from the datasets. The threshold is set to maximize the F1 score, balancing precision and recall to ensure optimal filtering performance.

Experimental Evaluation

Datasets

The methodology was tested on two critical benchmark datasets in IR research: TREC-DL (2019, 2020) and BEIR, focusing on four specific tasks within BEIR: Covid, Touche, Signal, and News. These datasets represent a diverse range of queries and text types, making them suitable for testing the robustness of the proposed methodology.

Metrics and Baselines

Performance was evaluated using the NDCG@10 metric, which is standard for these datasets. The results were compared against both supervised and unsupervised state-of-the-art re-ranking methods, including approaches like monoBERT, monoT5, RankT5, and RankGPT. Additionally, the paper compares the pre-filtering method with and without the pre-filtering step to highlight the improvement achieved.

Results

The pre-filtering approach demonstrates a notable improvement in re-ranking performance across various datasets, particularly excelling in two BEIR tasks: Touche and Signal. The optimal thresholds identified (0.3 for BEIR tasks and 0.6/0.7 for TREC tasks) indicate a consistent methodology for enhancing smaller LLMs' performance to levels competitive with much larger, resource-intensive models.

For instance, the method improved Mixtral-8x7B-Instruct’s nDCG@10 scores from 60.88 to 69.39 on TREC-DL2019 and from 55.85 to 64.42 on TREC-DL2020. Such results underscore the method's efficacy and the broader implication that well-tuned, smaller LLMs can achieve high performance in resource-constrained environments.

Implications and Future Work

This research contributes significantly to the field of IR by showing that pre-filtering can enable smaller LLMs to be competitive with larger, proprietary models, thus democratizing the application of AI in IR tasks. In practical terms, this could mean that more organizations and researchers could afford to use state-of-the-art re-ranking methods without being tied to expensive and resource-heavy LLMs.

Furthermore, the threshold-setting methodology, which relies on expert-generated qrels, presents a scalable approach that may reduce the burden of extensive labeling in future research.

For future work, expanding the testing to a broader range of datasets and models could provide further insights and refinements. Also, exploring adaptive thresholds that dynamically adjust based on query characteristics might offer an additional performance boost.

Conclusion

The paper presents an innovative, effective strategy for enhancing passage re-ranking in IR systems using LLMs. By introducing a pre-filtering step, the method significantly reduces irrelevant passage noise, thereby enhancing the overall effectiveness of the re-ranking process. This approach not only highlights possibilities for resource-efficient application of advanced LLM technologies but also opens avenues for further research into optimizing IR pipelines.

PDF Markdown Bookmark Chat (Pro)

Authors (2)

Baharan Nouriinanloo (1 paper)
Maxime Lamothe (7 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/_reachsumit/status/1806529521014997134

https://twitter.com/fly51fly/status/1809949693851341209

https://twitter.com/knishimae0531/status/1806921392920879250