Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models (2410.23841v2)

Published 31 Oct 2024 in cs.IR

Abstract: Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Task-aware retrieval with instructions. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  3650–3675, 2023.
  3. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  4. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
  5. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961, 2024.
  6. Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
  7. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  8. Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  9. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
  10. Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  11. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  12. Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024a.
  13. Open source strikes bread - new fluffy embeddings model, 2024b. URL https://www.mixedbread.ai/blog/mxbai-embed-large-v1.
  14. Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
  15. Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
  16. Muffin: Curating multi-faceted instructions for improving instruction following. In The Twelfth International Conference on Learning Representations, 2023.
  17. Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024. URL https://huggingface.co/Salesforce/SFR-Embedding-2_R.
  18. Stefano Mizzaro. How many relevances in information retrieval? Interacting with computers, 10(3):303–320, 1998.
  19. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
  20. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  21. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  22. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
  23. Instructir: A benchmark for instruction following of information retrieval models. arXiv preprint arXiv:2402.14334, 2024.
  24. Answer is all you need: Instruction-following text embedding via answering the question. arXiv preprint arXiv:2402.09642, 2024.
  25. Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv:2309.15088, 2023a.
  26. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze! arXiv preprint arXiv:2312.02724, 2023b.
  27. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  28. N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
  29. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  30. A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(2):95–145, 2003.
  31. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  1102–1121, 2023.
  32. Robert S Taylor. The process of asking questions. American documentation, 13(4):391–396, 1962.
  33. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  34. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  35. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  36. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
  37. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
  38. Followir: Evaluating and teaching information retrieval models to follow instructions. arXiv preprint arXiv:2403.15246, 2024a.
  39. Promptriever: Instruction-trained retrievers can be prompted like language models. arXiv preprint arXiv:2409.11136, 2024b.
  40. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
  41. Answering complex open-domain questions with multi-hop dense retrieval. arXiv preprint arXiv:2009.12756, 2020.
  42. Excluir: Exclusionary neural information retrieval. arXiv preprint arXiv:2404.17288, 2024.

Summary

  • The paper introduces InfoSearch, a novel benchmark assessing retrieval models on instruction following beyond traditional content relevance.
  • It implements two metrics, SICR and WISE, to measure strict compliance and nuanced sensitivity in adapting document rankings.
  • Experiments show that while reranking and large models improve instruction adherence, significant challenges remain for complex attributes.

Evaluating Instruction Following in Retrieval Models

The advancement of LLMs, particularly in instruction-following capabilities, has opened new avenues for user interactions with generative models. However, the progress in retrieval models has not kept pace with these capabilities, as they often rely on traditional methods focused primarily on content relevance. This paper, "Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models," proposes a framework and evaluation metrics to assess the proficiency of retrieval models in following complex, customized instructions beyond content relevance.

Framework and Benchmark Introduction

The authors present InfoSearch, an innovative evaluation benchmark designed to explore instruction-following capabilities of retrieval models across six document-level attributes: Audience, Keyword, Format, Language, Length, and Source. These dimensions are integral to understanding user preferences that extend beyond simple content matching. InfoSearch allows for a more tailored assessment by incorporating both instructed and reversely instructed modes, thus evaluating a model's ability to comprehend instructions in both affirmative and negation formats.

New Metrics for Assessment

To evaluate these capabilities, the paper introduces two novel evaluation metrics: Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE). SICR provides a strict criterion for instruction adherence by checking compliance across different retrieval modes. Meanwhile, WISE offers a more nuanced evaluation of the depth of instruction-following capability by accounting for changes in document rankings following instructions.

Experimental Evaluation

The paper conducts extensive testing over 15 retrieval models, encompassing both dense and reranking architectures, including prominent models like E5-Mistral, GPT-4o, and dense retrievers such as NV-Embed-v2 and GritLM. The findings highlight that while larger models and reranking techniques exhibit generally superior instruction-following abilities compared to traditional dense retrieval methods, substantial potential for improvement remains, particularly in complex attributes like Format and Audience.

Implications and Future Directions

The research underscores a significant gap between current retrieval models and the sophisticated instruction-following capabilities demanded in practice. The shift towards accommodating document-level features in retrieval models necessitates further research and potentially new training paradigms tailored to encompass these diverse attributes. Future developments in this context could involve richer pre-training datasets and hybrid architectures combining both retrieval and generative modeling techniques.

By highlighting the inadequacies and setting a new standard for evaluating retrieval systems, this paper takes a critical step toward aligning retrieval models more closely with the rich, nuanced requirements users demonstrate in querying contexts. As research progresses, we can expect more refined and instruction-sensitive retrieval capabilities that align with users' sophisticated expectations in diverse real-world scenarios.

X Twitter Logo Streamline Icon: https://streamlinehq.com