MAIR: A Massive Benchmark for Evaluating Instructed Retrieval (2410.10127v1)

Published 14 Oct 2024 in cs.IR

Abstract: Recent information retrieval (IR) models are pre-trained and instruction-tuned on massive datasets and tasks, enabling them to perform well on a wide range of tasks and potentially generalize to unseen tasks with instructions. However, existing IR benchmarks focus on a limited scope of tasks, making them insufficient for evaluating the latest IR models. In this paper, we propose MAIR (Massive Instructed Retrieval Benchmark), a heterogeneous IR benchmark that includes 126 distinct IR tasks across 6 domains, collected from existing datasets. We benchmark state-of-the-art instruction-tuned text embedding models and re-ranking models. Our experiments reveal that instruction-tuned models generally achieve superior performance compared to non-instruction-tuned models on MAIR. Additionally, our results suggest that current instruction-tuned text embedding models and re-ranking models still lack effectiveness in specific long-tail tasks. MAIR is publicly available at https://github.com/sunnweiwei/Mair.

PDF HTML Abstract

Evaluation of Instruction-Tuned Retrieval Models with the MAIR Benchmark

The paper presents "MAIR: A Massive Benchmark for Evaluating Instructed Retrieval," a comprehensive framework designed to assess the capabilities of modern information retrieval (IR) models. This benchmark introduces a vast array of 126 distinct IR tasks, spanning six diverse domains to rigorously test the generalization potential of instruction-tuned models.

Motivation and Scope

Recent IR models, typically pre-trained and instruction-tuned, are designed to handle varied IR tasks effectively. However, existing benchmarks such as BEIR, KILT, and MTEB offer limited task diversity and insufficient scope for a holistic evaluation of these models. MAIR addresses this gap by providing a more heterogeneous benchmark incorporating numerous tasks collected from established datasets and publicly shared IR challenges, including TREC tracks and LLM evaluation datasets.

Benchmark Composition

MAIR encompasses 126 tasks with 805 annotated instructions and a dataset comprising over 10,000 queries and 4 million documents. The tasks cover multiple domains such as web, academic, code, medical, legal, and finance, each presenting unique retrieval challenges. This extensive coverage ensures a comprehensive evaluation of IR models' ability to follow detailed instructions and generalize across different domains.

Experimental Setup and Model Evaluation

The experiments conducted with MAIR involve various retrieval models:

Sparse retrieval models like BM25.
Embedding models, including single-task and multi-task variants.
Instruction-tuned embedding models, exemplified by e5-mistral-7b-instruct.
Advanced re-ranking models and LLM-based re-rankers.

Evaluations are performed using the nDCG@10 metric, assessing models with and without instruction inputs. Results show that instruction-tuned models, particularly GritLM-7B, exhibit superior performance when instructions are included, indicating a notable improvement in understanding and following instructions.

Numerical Results and Observations

Instruction-tuned models consistently outperform their non-instruction-tuned counterparts across most tasks, with enhancements in interpretability and task adherence. The models display a measurable increase in nDCG scores when processing instruction-rich inputs. For instance, GritLM-7B's average nDCG@10 rises to 55.20 with instruction inputs, underscoring the efficacy of instruction tuning.

Comparative Analysis

A comparison with MTEB highlights similarities in the performance trend, but MAIR uniquely challenges models with tasks that demand broader generalization. Single-task models manifest limited performance on MAIR, reaffirming the benchmark's demanding nature and the necessity for versatile IR capabilities.

Theoretical and Practical Implications

MAIR serves a dual purpose: it evaluates current model performance and illuminates areas needing further development, particularly in interpreting complex instructions. The consistency of instruction-tuned models across tasks suggests promising directions for future research, including enhancing training data diversity and refining fine-tuning strategies.

Conclusion and Future Directions

MAIR presents a robust benchmark to propel the evaluation of instruction-tuned IR models, providing valuable insights that could inform future developments. While it offers significant strides, potential advancements include exploring multilingual benchmarks and examining prompt sensitivity effects. Future work could expand on these innovative frontiers, continually refining the evaluation landscape for retrieval models.

By offering such a diverse and comprehensive evaluation platform, MAIR sets new standards for examining IR models' generalizability and instructional understanding, proving indispensable for researchers in the field.