Evaluation of Instruction-Tuned Retrieval Models with the MAIR Benchmark
The paper presents "MAIR: A Massive Benchmark for Evaluating Instructed Retrieval," a comprehensive framework designed to assess the capabilities of modern information retrieval (IR) models. This benchmark introduces a vast array of 126 distinct IR tasks, spanning six diverse domains to rigorously test the generalization potential of instruction-tuned models.
Motivation and Scope
Recent IR models, typically pre-trained and instruction-tuned, are designed to handle varied IR tasks effectively. However, existing benchmarks such as BEIR, KILT, and MTEB offer limited task diversity and insufficient scope for a holistic evaluation of these models. MAIR addresses this gap by providing a more heterogeneous benchmark incorporating numerous tasks collected from established datasets and publicly shared IR challenges, including TREC tracks and LLM evaluation datasets.
Benchmark Composition
MAIR encompasses 126 tasks with 805 annotated instructions and a dataset comprising over 10,000 queries and 4 million documents. The tasks cover multiple domains such as web, academic, code, medical, legal, and finance, each presenting unique retrieval challenges. This extensive coverage ensures a comprehensive evaluation of IR models' ability to follow detailed instructions and generalize across different domains.
Experimental Setup and Model Evaluation
The experiments conducted with MAIR involve various retrieval models:
- Sparse retrieval models like BM25.
- Embedding models, including single-task and multi-task variants.
- Instruction-tuned embedding models, exemplified by e5-mistral-7b-instruct.
- Advanced re-ranking models and LLM-based re-rankers.
Evaluations are performed using the nDCG@10 metric, assessing models with and without instruction inputs. Results show that instruction-tuned models, particularly GritLM-7B, exhibit superior performance when instructions are included, indicating a notable improvement in understanding and following instructions.
Numerical Results and Observations
Instruction-tuned models consistently outperform their non-instruction-tuned counterparts across most tasks, with enhancements in interpretability and task adherence. The models display a measurable increase in nDCG scores when processing instruction-rich inputs. For instance, GritLM-7B's average nDCG@10 rises to 55.20 with instruction inputs, underscoring the efficacy of instruction tuning.
Comparative Analysis
A comparison with MTEB highlights similarities in the performance trend, but MAIR uniquely challenges models with tasks that demand broader generalization. Single-task models manifest limited performance on MAIR, reaffirming the benchmark's demanding nature and the necessity for versatile IR capabilities.
Theoretical and Practical Implications
MAIR serves a dual purpose: it evaluates current model performance and illuminates areas needing further development, particularly in interpreting complex instructions. The consistency of instruction-tuned models across tasks suggests promising directions for future research, including enhancing training data diversity and refining fine-tuning strategies.
Conclusion and Future Directions
MAIR presents a robust benchmark to propel the evaluation of instruction-tuned IR models, providing valuable insights that could inform future developments. While it offers significant strides, potential advancements include exploring multilingual benchmarks and examining prompt sensitivity effects. Future work could expand on these innovative frontiers, continually refining the evaluation landscape for retrieval models.
By offering such a diverse and comprehensive evaluation platform, MAIR sets new standards for examining IR models' generalizability and instructional understanding, proving indispensable for researchers in the field.