Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions (2403.15246v3)

Published 22 Mar 2024 in cs.IR, cs.CL, and cs.LG

Abstract: Modern LLMs (LMs) are capable of following long and complex instructions that enable a large and diverse set of user requests. While Information Retrieval (IR) models use these LMs as the backbone of their architectures, virtually none of them allow users to provide detailed instructions alongside queries, thus limiting their ability to satisfy complex information needs. In this work, we study the use of instructions in IR systems. First, we introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark as well as a training set for helping IR models learn to better follow real-world instructions. FollowIR repurposes detailed instructions -- also known as narratives -- developed for professional assessors to evaluate retrieval systems. In particular, we build our benchmark from three collections curated for shared tasks at the Text REtrieval Conference (TREC). These collections contains hundreds to thousands of labeled documents per query, making them suitable for our exploration. Through this process, we can measure how well IR models follow instructions, through a new pairwise evaluation framework. Our results indicate that existing retrieval models fail to correctly use instructions, using them for basic keywords and struggling to understand long-form information. However, we show that it is possible for IR models to learn to follow complex instructions: our new FollowIR-7B model has significant improvements after fine-tuning on our training set.

Evaluating Instruction Following in Information Retrieval with the FollowIR Dataset

Introduction

Information Retrieval (IR) models, despite being increasingly incorporated with LLMs, largely remain deficient in adapting to user-specified instructions for refining queries. This gap between the potential of LLMs in understanding complex instructions and the actual usage of these capabilities in IR models underscores a significant area for enhancement in semantic search technologies. The paper on FollowIR introduces a comprehensive dataset and benchmark that aims to bridge this gap by evaluating and enhancing the ability of IR models to follow nuanced user instructions. The FollowIR dataset leverages the robust foundation of TREC (Text REtrieval Conference) annotations, providing a rigorous mechanism for benchmarking instruction following in IR through an adapted evaluation framework.

Dataset Construction

The FollowIR dataset is meticulously crafted from deeply judged TREC collections, encompassing a diverse range of queries and corresponding human annotator instructions. The construction involves re-annotating documents based on slightly altered instructions to assess the adaptability and instruction sensitivity of IR models. This unique approach ensures a rigorous benchmark by focusing on the differential impact of instruction modifications on document relevance, thus isolating the specific ability to follow instructions. The dataset encompasses three major TREC tracks, namely TREC News 2021, TREC Common Core 2017, and TREC Robust 2004, providing a rich foundation for evaluating IR models across various domains and information needs.

Evaluation Framework

The novel evaluation framework introduced alongside the FollowIR dataset, termed pp-MRR (pairwise Mean Reciprocal Rank), provides a specialized metric for assessing instruction following by comparing model performance across pairs of original and modified instructions for the same query. This framework is designed to reflect changes in document rankings in response to alterations in instructions, hence directly measuring the capability of IR models to adapt to nuanced instruction changes. This evaluation paradigm ensures a focused assessment of instruction sensitivity, distinct from traditional IR evaluation metrics.

Findings and Implications

The evaluation of a broad spectrum of IR models using the FollowIR dataset reveals a notable deficiency in current models' ability to effectively incorporate and follow detailed instructions. The paper highlights a particular challenge with handling long-form instructions and leveraging instructions beyond mere keywords extraction. However, the research also identifies a pathway towards improving this capability—training on a dataset containing real-world, complex instructions demonstrates a potential for significant enhancements in instruction following. The introduction of the FollowIR-7B model, fine-tuned on this novel training dataset, marks a promising development, showcasing improved performance both in traditional IR metrics and in the newly proposed pp-MRR metric.

Future Directions

The findings underscore the necessity for continued advancements in integrating instruction following capabilities within IR models. Given the foundational nature of the FollowIR dataset and evaluation framework, future research directions are aplenty. These could include the exploration of fine-tuning techniques specific to instruction sensitivity, the development of models inherently designed to interpret and adapt to complex instructions, and the expansion of the FollowIR dataset to encompass even broader instruction and query domains. Furthermore, this line of research could significantly benefit from integrating insights from human-computer interaction studies to better understand the nuances of instruction formulation by end-users.

Conclusion

The work presented in the paper sheds light on a critical yet underexplored facet of information retrieval—the ability of IR models to follow user-provided instructions effectively. By introducing the FollowIR dataset and a specialized evaluation framework, the research provides valuable tools for advancing the state-of-the-art in instruction-sensitive IR models. The demonstration of tangible improvements through targeted training sets a precedent for future efforts aimed at making IR systems more adaptable and responsive to the intricate needs of users.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Trec 2017 common core track overview. In TREC, 2017.
  3. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260, 2022.
  4. Trec 2006 legal track overview. In TREC. Citeseer, 2006.
  5. Gpt-neox-20b: An open-source autoregressive language model. ArXiv, abs/2204.06745, 2022. URL https://api.semanticscholar.org/CorpusID:248177957.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Generalizing conversational dense retrieval via llm-cognition data augmentation. arXiv preprint arXiv:2402.07092, 2024.
  8. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023.
  9. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  10. Deeper text understanding for ir with contextual neural language modeling. In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, pp.  985–988, 2019.
  11. Olmo: Accelerating the science of language models. 2024. URL https://api.semanticscholar.org/CorpusID:267365485.
  12. Hiyouga. Llama factory. https://github.com/hiyouga/LLaMA-Factory, 2023.
  13. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  14. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  15. Mistral 7b. ArXiv, abs/2310.06825, 2023. URL https://api.semanticscholar.org/CorpusID:263830494.
  16. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  17. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp.  39–48, 2020.
  18. Overview of the TREC 2023 NeuCLIR track. 2024.
  19. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pp.  22631–22648. PMLR, 2023.
  20. Fine-tuning llama for multi-stage text retrieval. arXiv preprint arXiv:2310.08319, 2023.
  21. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
  22. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  23. Multi-stage document ranking with bert. arXiv preprint arXiv:1910.14424, 2019.
  24. Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713, 2020.
  25. Overview of the trec 2008 legal track. In TREC, pp.  500–277, 2008.
  26. Instructir: A benchmark for instruction following of information retrieval models. arXiv preprint arXiv:2402.14334, 2024.
  27. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  28. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze! arXiv preprint arXiv:2312.02724, 2023.
  29. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023.
  30. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  31. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
  32. Customizing contextualized language models for legal document reviews. In 2020 IEEE International Conference on Big Data (Big Data), pp.  2139–2148. IEEE, 2020.
  33. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  34. Ian Soboroff. Overview of trec 2021. In 30th Text REtrieval Conference. Gaithersburg, Maryland, 2021.
  35. Trec 2018 news track overview. In TREC, volume 409, pp.  410, 2018.
  36. Trec 2020 news track overview. In TREC, 2020.
  37. One embedder, any task: Instruction-finetuned text embeddings. 2022. URL https://arxiv.org/abs/2212.09741.
  38. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663, 2021.
  39. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  40. Llama 2: Open foundation and fine-tuned chat models. ArXiv, abs/2307.09288, 2023b. URL https://api.semanticscholar.org/CorpusID:259950998.
  41. Ellen M Voorhees. The trec robust retrieval track. In ACM SIGIR Forum, volume 39, pp.  11–20. ACM New York, NY, USA, 2005.
  42. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022a.
  43. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
  44. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
  45. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705, 2022c.
  46. Statistical power in retrieval experimentation. In Proceedings of the 17th ACM conference on Information and knowledge management, pp.  571–580, 2008.
  47. Learning from task descriptions. arXiv preprint arXiv:2011.08115, 2020.
  48. When do generative query and document expansions fail? a comprehensive study across methods, retrievers, and datasets. arXiv preprint arXiv:2309.08541, 2023.
  49. Nevir: Negation in neural information retrieval. Conference of the European Chapter of the Association for Computational Linguistics, 2024. URL https://api.semanticscholar.org/CorpusID:258676146.
  50. C-pack: Packaged resources to advance general chinese embedding, 2023.
  51. On the evaluation of vision-and-language navigation instructions. arXiv preprint arXiv:2101.10504, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Orion Weller (30 papers)
  2. Benjamin Chang (3 papers)
  3. Sean MacAvaney (75 papers)
  4. Kyle Lo (73 papers)
  5. Arman Cohan (121 papers)
  6. Benjamin Van Durme (173 papers)
  7. Dawn Lawrie (30 papers)
  8. Luca Soldaini (62 papers)
Citations (15)
Youtube Logo Streamline Icon: https://streamlinehq.com