Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models (2410.23841v2)

Published 31 Oct 2024 in cs.IR

Abstract: Instruction-following capabilities in LLMs have progressed significantly, enabling more complex user interactions through detailed prompts. However, retrieval systems have not matched these advances, most of them still relies on traditional lexical and semantic matching techniques that fail to fully capture user intent. Recent efforts have introduced instruction-aware retrieval models, but these primarily focus on intrinsic content relevance, which neglects the importance of customized preferences for broader document-level attributes. This study evaluates the instruction-following capabilities of various retrieval models beyond content relevance, including LLM-based dense retrieval and reranking models. We develop InfoSearch, a novel retrieval evaluation benchmark spanning six document-level attributes: Audience, Keyword, Format, Language, Length, and Source, and introduce novel metrics -- Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE) to accurately assess the models' responsiveness to instructions. Our findings indicate that although fine-tuning models on instruction-aware retrieval datasets and increasing model size enhance performance, most models still fall short of instruction compliance.

References (42)

Summary

The paper introduces InfoSearch, a novel benchmark assessing retrieval models on instruction following beyond traditional content relevance.
It implements two metrics, SICR and WISE, to measure strict compliance and nuanced sensitivity in adapting document rankings.
Experiments show that while reranking and large models improve instruction adherence, significant challenges remain for complex attributes.

Evaluating Instruction Following in Retrieval Models

The advancement of LLMs, particularly in instruction-following capabilities, has opened new avenues for user interactions with generative models. However, the progress in retrieval models has not kept pace with these capabilities, as they often rely on traditional methods focused primarily on content relevance. This paper, "Beyond Content Relevance: Evaluating Instruction Following in Retrieval Models," proposes a framework and evaluation metrics to assess the proficiency of retrieval models in following complex, customized instructions beyond content relevance.

Framework and Benchmark Introduction

The authors present InfoSearch, an innovative evaluation benchmark designed to explore instruction-following capabilities of retrieval models across six document-level attributes: Audience, Keyword, Format, Language, Length, and Source. These dimensions are integral to understanding user preferences that extend beyond simple content matching. InfoSearch allows for a more tailored assessment by incorporating both instructed and reversely instructed modes, thus evaluating a model's ability to comprehend instructions in both affirmative and negation formats.

New Metrics for Assessment

To evaluate these capabilities, the paper introduces two novel evaluation metrics: Strict Instruction Compliance Ratio (SICR) and Weighted Instruction Sensitivity Evaluation (WISE). SICR provides a strict criterion for instruction adherence by checking compliance across different retrieval modes. Meanwhile, WISE offers a more nuanced evaluation of the depth of instruction-following capability by accounting for changes in document rankings following instructions.

Experimental Evaluation

The paper conducts extensive testing over 15 retrieval models, encompassing both dense and reranking architectures, including prominent models like E5-Mistral, GPT-4o, and dense retrievers such as NV-Embed-v2 and GritLM. The findings highlight that while larger models and reranking techniques exhibit generally superior instruction-following abilities compared to traditional dense retrieval methods, substantial potential for improvement remains, particularly in complex attributes like Format and Audience.

Implications and Future Directions

The research underscores a significant gap between current retrieval models and the sophisticated instruction-following capabilities demanded in practice. The shift towards accommodating document-level features in retrieval models necessitates further research and potentially new training paradigms tailored to encompass these diverse attributes. Future developments in this context could involve richer pre-training datasets and hybrid architectures combining both retrieval and generative modeling techniques.

By highlighting the inadequacies and setting a new standard for evaluating retrieval systems, this paper takes a critical step toward aligning retrieval models more closely with the rich, nuanced requirements users demonstrate in querying contexts. As research progresses, we can expect more refined and instruction-sensitive retrieval capabilities that align with users' sophisticated expectations in diverse real-world scenarios.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_reachsumit/status/1852184864101142864