Investigating Pre-Filtering for Re-Ranking with LLMs
Introduction
The paper "Re-Ranking Step by Step: Investigating Pre-Filtering for Re-Ranking with LLMs" explores enhancing information retrieval (IR) systems by incorporating a novel pre-filtering step before employing LLMs for passage re-ranking. The authors, Baharan Nouriinanloo and Maxime Lamothe, explore the potential for smaller, open-source LLMs to perform competitively with larger, proprietary models such as GPT-4 by filtering out irrelevant passages before the re-ranking process.
Methodology
Pre-Filtering Approach
The core concept introduced is a pre-filtering step that leverages LLMs to assign relevance scores to passages retrieved by an initial retrieval stage, typically constructed with methods like BM25. This pre-filtering step is designed to discard irrelevant passages based on a threshold relevance score, thereby reducing the noise and the number of passages passed to the re-ranking stage.
The authors employ Mixtral-8x7B-Instruct, a smaller open-source LLM, for both the pre-filtering and re-ranking processes. The pre-filtering uses a prompting strategy that incorporates Chain-of-Thought (CoT) and Plan-and-Solve (PS) reasoning methods to ensure a thorough understanding of queries and passages before generating relevance scores.
Prompt Design and Threshold Setting
The designed prompt instructs the LLM first to comprehend the query and passages and then to assign a relevance score within the range of 0 to 1. The threshold for relevance is determined using a small, human-generated subset of expert relevance scores (qrels) from the datasets. The threshold is set to maximize the F1 score, balancing precision and recall to ensure optimal filtering performance.
Experimental Evaluation
Datasets
The methodology was tested on two critical benchmark datasets in IR research: TREC-DL (2019, 2020) and BEIR, focusing on four specific tasks within BEIR: Covid, Touche, Signal, and News. These datasets represent a diverse range of queries and text types, making them suitable for testing the robustness of the proposed methodology.
Metrics and Baselines
Performance was evaluated using the NDCG@10 metric, which is standard for these datasets. The results were compared against both supervised and unsupervised state-of-the-art re-ranking methods, including approaches like monoBERT, monoT5, RankT5, and RankGPT. Additionally, the paper compares the pre-filtering method with and without the pre-filtering step to highlight the improvement achieved.
Results
The pre-filtering approach demonstrates a notable improvement in re-ranking performance across various datasets, particularly excelling in two BEIR tasks: Touche and Signal. The optimal thresholds identified (0.3 for BEIR tasks and 0.6/0.7 for TREC tasks) indicate a consistent methodology for enhancing smaller LLMs' performance to levels competitive with much larger, resource-intensive models.
For instance, the method improved Mixtral-8x7B-Instruct’s nDCG@10 scores from 60.88 to 69.39 on TREC-DL2019 and from 55.85 to 64.42 on TREC-DL2020. Such results underscore the method's efficacy and the broader implication that well-tuned, smaller LLMs can achieve high performance in resource-constrained environments.
Implications and Future Work
This research contributes significantly to the field of IR by showing that pre-filtering can enable smaller LLMs to be competitive with larger, proprietary models, thus democratizing the application of AI in IR tasks. In practical terms, this could mean that more organizations and researchers could afford to use state-of-the-art re-ranking methods without being tied to expensive and resource-heavy LLMs.
Furthermore, the threshold-setting methodology, which relies on expert-generated qrels, presents a scalable approach that may reduce the burden of extensive labeling in future research.
For future work, expanding the testing to a broader range of datasets and models could provide further insights and refinements. Also, exploring adaptive thresholds that dynamically adjust based on query characteristics might offer an additional performance boost.
Conclusion
The paper presents an innovative, effective strategy for enhancing passage re-ranking in IR systems using LLMs. By introducing a pre-filtering step, the method significantly reduces irrelevant passage noise, thereby enhancing the overall effectiveness of the re-ranking process. This approach not only highlights possibilities for resource-efficient application of advanced LLM technologies but also opens avenues for further research into optimizing IR pipelines.