Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models (2410.23841v2)
Abstract: Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Task-aware retrieval with instructions. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 3650–3675, 2023.
- Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
- Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268, 2016.
- Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961, 2024.
- Gpt-neox-20b: An open-source autoregressive language model. arXiv preprint arXiv:2204.06745, 2022.
- Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
- Jacob Devlin. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821, 2021.
- Unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Nv-embed: Improved techniques for training llms as generalist embedding models. arXiv preprint arXiv:2405.17428, 2024a.
- Open source strikes bread - new fluffy embeddings model, 2024b. URL https://www.mixedbread.ai/blog/mxbai-embed-large-v1.
- Angle-optimized text embeddings. arXiv preprint arXiv:2309.12871, 2023.
- Towards general text embeddings with multi-stage contrastive learning. arXiv preprint arXiv:2308.03281, 2023.
- Muffin: Curating multi-faceted instructions for improving instruction following. In The Twelfth International Conference on Learning Representations, 2023.
- Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024. URL https://huggingface.co/Salesforce/SFR-Embedding-2_R.
- Stefano Mizzaro. How many relevances in information retrieval? Interacting with computers, 10(3):303–320, 1998.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316, 2022.
- Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
- Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
- Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models. arXiv preprint arXiv:2108.08877, 2021.
- Instructir: A benchmark for instruction following of information retrieval models. arXiv preprint arXiv:2402.14334, 2024.
- Answer is all you need: Instruction-following text embedding via answering the question. arXiv preprint arXiv:2402.09642, 2024.
- Rankvicuna: Zero-shot listwise document reranking with open-source large language models. arXiv preprint arXiv:2309.15088, 2023a.
- Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze! arXiv preprint arXiv:2312.02724, 2023b.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
- N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084, 2019.
- The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
- A survey on the use of relevance feedback for information access systems. The Knowledge Engineering Review, 18(2):95–145, 2003.
- One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 1102–1121, 2023.
- Robert S Taylor. The process of asking questions. American documentation, 13(4):391–396, 1962.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022.
- Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023.
- Followir: Evaluating and teaching information retrieval models to follow instructions. arXiv preprint arXiv:2403.15246, 2024a.
- Promptriever: Instruction-trained retrievers can be prompted like language models. arXiv preprint arXiv:2409.11136, 2024b.
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023.
- Answering complex open-domain questions with multi-hop dense retrieval. arXiv preprint arXiv:2009.12756, 2020.
- Excluir: Exclusionary neural information retrieval. arXiv preprint arXiv:2404.17288, 2024.