FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions (2403.15246v3)

Published 22 Mar 2024 in cs.IR, cs.CL, and cs.LG

Abstract: Modern LLMs (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions -- also known as narratives -- developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.

PDF HTML Abstract

Evaluating Instruction Following in Information Retrieval with the FollowIR Dataset

Introduction

Information Retrieval (IR) models, despite being increasingly incorporated with LLMs, largely remain deficient in adapting to user-specified instructions for refining queries. This gap between the potential of LLMs in understanding complex instructions and the actual usage of these capabilities in IR models underscores a significant area for enhancement in semantic search technologies. The paper on FollowIR introduces a comprehensive dataset and benchmark that aims to bridge this gap by evaluating and enhancing the ability of IR models to follow nuanced user instructions. The FollowIR dataset leverages the robust foundation of TREC (Text REtrieval Conference) annotations, providing a rigorous mechanism for benchmarking instruction following in IR through an adapted evaluation framework.

Dataset Construction

The FollowIR dataset is meticulously crafted from deeply judged TREC collections, encompassing a diverse range of queries and corresponding human annotator instructions. The construction involves re-annotating documents based on slightly altered instructions to assess the adaptability and instruction sensitivity of IR models. This unique approach ensures a rigorous benchmark by focusing on the differential impact of instruction modifications on document relevance, thus isolating the specific ability to follow instructions. The dataset encompasses three major TREC tracks, namely TREC News 2021, TREC Common Core 2017, and TREC Robust 2004, providing a rich foundation for evaluating IR models across various domains and information needs.

Evaluation Framework

The novel evaluation framework introduced alongside the FollowIR dataset, termed $p$ -MRR (pairwise Mean Reciprocal Rank), provides a specialized metric for assessing instruction following by comparing model performance across pairs of original and modified instructions for the same query. This framework is designed to reflect changes in document rankings in response to alterations in instructions, hence directly measuring the capability of IR models to adapt to nuanced instruction changes. This evaluation paradigm ensures a focused assessment of instruction sensitivity, distinct from traditional IR evaluation metrics.

Findings and Implications

The evaluation of a broad spectrum of IR models using the FollowIR dataset reveals a notable deficiency in current models' ability to effectively incorporate and follow detailed instructions. The paper highlights a particular challenge with handling long-form instructions and leveraging instructions beyond mere keywords extraction. However, the research also identifies a pathway towards improving this capability—training on a dataset containing real-world, complex instructions demonstrates a potential for significant enhancements in instruction following. The introduction of the FollowIR-7B model, fine-tuned on this novel training dataset, marks a promising development, showcasing improved performance both in traditional IR metrics and in the newly proposed $p$ -MRR metric.

Future Directions

The findings underscore the necessity for continued advancements in integrating instruction following capabilities within IR models. Given the foundational nature of the FollowIR dataset and evaluation framework, future research directions are aplenty. These could include the exploration of fine-tuning techniques specific to instruction sensitivity, the development of models inherently designed to interpret and adapt to complex instructions, and the expansion of the FollowIR dataset to encompass even broader instruction and query domains. Furthermore, this line of research could significantly benefit from integrating insights from human-computer interaction studies to better understand the nuances of instruction formulation by end-users.

Conclusion

The work presented in the paper sheds light on a critical yet underexplored facet of information retrieval—the ability of IR models to follow user-provided instructions effectively. By introducing the FollowIR dataset and a specialized evaluation framework, the research provides valuable tools for advancing the state-of-the-art in instruction-sensitive IR models. The demonstration of tangible improvements through targeted training sets a precedent for future efforts aimed at making IR systems more adaptable and responsive to the intricate needs of users.