Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

INSTRUCTIR: A Benchmark for Instruction Following of Information Retrieval Models (2402.14334v1)

Published 22 Feb 2024 in cs.CL

Abstract: Despite the critical need to align search targets with users' intention, retrievers often only prioritize query information without delving into the users' intended search context. Enhancing the capability of retrievers to understand intentions and preferences of users, akin to LLM instructions, has the potential to yield more aligned search targets. Prior studies restrict the application of instructions in information retrieval to a task description format, neglecting the broader context of diverse and evolving search scenarios. Furthermore, the prevailing benchmarks utilized for evaluation lack explicit tailoring to assess instruction-following ability, thereby hindering progress in this field. In response to these limitations, we propose a novel benchmark,INSTRUCTIR, specifically designed to evaluate instruction-following ability in information retrieval tasks. Our approach focuses on user-aligned instructions tailored to each query instance, reflecting the diverse characteristics inherent in real-world search scenarios. Through experimental analysis, we observe that retrievers fine-tuned to follow task-style instructions, such as INSTRUCTOR, can underperform compared to their non-instruction-tuned counterparts. This underscores potential overfitting issues inherent in constructing retrievers trained on existing instruction-aware retrieval datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Task-aware retrieval with instructions. In Findings of the ACL.
  2. Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement.
  3. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv.
  4. Lmentry: A language model benchmark of elementary language tasks. arXiv.
  5. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res.
  6. Followbench: A multi-level fine-grained constraints following benchmark for large language models. arXiv.
  7. J Richard Landis and Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics.
  8. Holistic evaluation of language models. arXiv.
  9. Query rewriting for retrieval-augmented large language models. arXiv.
  10. Fine-tuning llama for multi-stage text retrieval. arXiv.
  11. Mteb: Massive text embedding benchmark. In EACL.
  12. Ms marco: A human-generated machine reading comprehension dataset.
  13. Large dual encoders are generalizable retrievers. In EMNLP.
  14. Ktrl+ f: Knowledge-augmented in-document search. arXiv.
  15. OpenAI. 2023. Gpt-4 technical report.
  16. Training language models to follow instructions with human feedback. ArXiv.
  17. Justus J. Randolph. 2005. Free-marginal multirater kappa (multirater k[free]): An alternative to fleiss’ fixed-marginal multirater kappa.
  18. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389.
  19. Colbertv2: Effective and efficient retrieval via lightweight late interaction. In NAACL.
  20. One embedder, any task: Instruction-finetuned text embeddings. ArXiv.
  21. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models. ArXiv.
  22. Improving text embeddings with large language models. arXiv.
  23. Self-instruct: Aligning language model with self generated instructions. arXiv.
  24. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv.
  25. Uniir: Training and benchmarking universal multimodal information retrievers. ArXiv.
  26. Flask: Fine-grained language model evaluation based on alignment skill sets. arXiv.
  27. Instruction tuning for large language models: A survey. ArXiv.
  28. Instruction tuning for large language models: A survey. arXiv.
  29. Romqa: A benchmark for robust, multi-evidence, multi-answer question answering. arXiv.
  30. Instruction-following evaluation for large language models. arXiv.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Hanseok Oh (8 papers)
  2. Hyunji Lee (19 papers)
  3. Seonghyeon Ye (25 papers)
  4. Haebin Shin (6 papers)
  5. Hansol Jang (5 papers)
  6. Changwook Jun (4 papers)
  7. Minjoon Seo (82 papers)
Citations (13)