Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning (2401.06532v3)

Published 12 Jan 2024 in cs.CL and cs.IR
INTERS: Unlocking the Power of Large Language Models in Search with Instruction Tuning

Abstract: LLMs have demonstrated impressive capabilities in various natural language processing tasks. Despite this, their application to information retrieval (IR) tasks is still challenging due to the infrequent occurrence of many IR-specific concepts in natural language. While prompt-based methods can provide task descriptions to LLMs, they often fall short in facilitating a comprehensive understanding and execution of IR tasks, thereby limiting LLMs' applicability. To address this gap, in this work, we explore the potential of instruction tuning to enhance LLMs' proficiency in IR tasks. We introduce a novel instruction tuning dataset, INTERS, encompassing 20 tasks across three fundamental IR categories: query understanding, document understanding, and query-document relationship understanding. The data are derived from 43 distinct datasets with manually written templates. Our empirical results reveal that INTERS significantly boosts the performance of various publicly available LLMs, such as LLaMA, Mistral, and Phi, in IR tasks. Furthermore, we conduct extensive experiments to analyze the effects of instruction design, template diversity, few-shot demonstrations, and the volume of instructions on performance. We make our dataset and the fine-tuned models publicly accessible at https://github.com/DaoD/INTERS.

Introduction to INTERS

In the field of NLP, the integration of LLMs in information retrieval (IR) has brought about some exciting developments. While LLMs have been making strides in various tasks, their performance in IR-specific tasks has been somewhat inconsistent, especially when compared to smaller models. The discrepancy is attributed to the complexity of IR-specific concepts and their rarity in natural language data, which makes them difficult for LLMs to comprehend. However, a new approach known as "instruction tuning" is emerging as a solution to overcome these challenges and enhance the IR capabilities of LLMs.

Enhancing LLMs' IR Performance

Addressing the challenges associated with LLMs and IR tasks, a novel dataset called INTERS was created. INTERS stands for INstruction Tuning datasEt foR Search, and as the name suggests, it is designed to refine the search abilities of LLMs. The dataset encompasses 21 tasks, deriving from 43 unique datasets, which help models improve in query understanding, document understanding, and comprehending the relationship between queries and documents. The ultimate goal of INTERS is to provide LLMs with the foundation to be instruction-tuned specifically for search-related tasks, thus unlocking their potential in this domain.

Empirical Results and Dataset Accessibility

The dataset not only establishes a new benchmark for enabling LLMs to perform search tasks more effectively but also offers a stepping stone for the models to excel in tasks they haven't directly learned from. Various publicly accessible LLMs like LLaMA, Mistral, and Phi have shown significant performance boosts when fine-tuned with INTERS. Moreover, to aid in transparency and further research, the novel INTERS dataset, and fine-tuned models are made publicly accessible, providing ample opportunity for replication and further enhancement by the research community.

Deep Dive into Experimentation

By conducting rigorous experiments, the researchers of this work dissected multiple aspects: the impact of different instruction designs, the influence of the volume of training data, and the relevance of task variety in improving LLM performance. They found that detailed task descriptions and a diversity of instructional data are vital in instruction tuning. Interestingly, few-shot examples, where models get a few examples of a task, have proven to aid models in adjusting to new tasks remarkably well. This work encapsulates comprehensive and insightful experimentation that enhances the understanding and optimization of LLMs for IR tasks.

In conclusion, INTERS is a robust and specialized instructional tuning dataset that stands out for its comprehensive design tailored to search tasks. It is not only effective in improving performance across a wide range of search-related tasks but also facilitates a better understanding of the factors influencing model optimization for IR tasks. The release of INTERS and its resulting models promises to be invaluable to researchers and practitioners aiming to push the boundaries of LLM applications in search.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. GQA: training generalized multi-query transformer models from multi-head checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 4895–4901. Association for Computational Linguistics.
  2. ORCAS-I: queries annotated with intent using weak supervision. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 3057–3066. ACM.
  3. Open-domain question answering goes conversational via question rewriting. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pages 520–534. Association for Computational Linguistics.
  4. Dbpedia: A nucleus for a web of open data. In The Semantic Web, 6th International Semantic Web Conference, 2nd Asian Semantic Web Conference, ISWC 2007 + ASWC 2007, Busan, Korea, November 11-15, 2007, volume 4825 of Lecture Notes in Computer Science, pages 722–735. Springer.
  5. Longformer: The long-document transformer. CoRR, abs/2004.05150.
  6. Overview of touché 2020: Argument retrieval - extended abstract. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22-25, 2020, Proceedings, volume 12260 of Lecture Notes in Computer Science, pages 384–395. Springer.
  7. A full-text learning to rank dataset for medical information retrieval. In Advances in Information Retrieval - 38th European Conference on IR Research, ECIR 2016, Padua, Italy, March 20-23, 2016. Proceedings, volume 9626 of Lecture Notes in Computer Science, pages 716–722. Springer.
  8. Generating long sequences with sparse transformers. CoRR, abs/1904.10509.
  9. Quac: Question answering in context. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2174–2184. Association for Computational Linguistics.
  10. Scaling instruction-finetuned language models. CoRR, abs/2210.11416.
  11. Boolq: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 2924–2936. Association for Computational Linguistics.
  12. SPECTER: document-level representation learning using citation-informed transformers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 2270–2282. Association for Computational Linguistics.
  13. TREC cast 2019: The conversational assistance track overview. CoRR, abs/2003.13624.
  14. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  15. CLIMATE-FEVER: A dataset for verification of real-world climate claims. CoRR, abs/2012.00614.
  16. Can you unpack that? learning to rewrite questions-in-context. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 5917–5923. Association for Computational Linguistics.
  17. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 1074–1084. Association for Computational Linguistics.
  18. Cqadupstack: A benchmark data set for community question-answering research. In Proceedings of the 20th Australasian Document Computing Symposium, ADCS 2015, Parramatta, NSW, Australia, December 8-9, 2015, pages 3:1–3:8. ACM.
  19. Mistral 7b. CoRR, abs/2310.06825.
  20. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, pages 1601–1611. Association for Computational Linguistics.
  21. Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics, 7:452–466.
  22. Colossal-ai: A unified deep learning system for large-scale parallel training. In Proceedings of the 52nd International Conference on Parallel Processing, ICPP ’23, page 766–775, New York, NY, USA. Association for Computing Machinery.
  23. Generating wikipedia by summarizing long sequences. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  24. Webglm: Towards an efficient web-enhanced question answering system with human preferences. In Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023, Long Beach, CA, USA, August 6-10, 2023, pages 4549–4560. ACM.
  25. Fine-tuning llama for multi-stage text retrieval. CoRR, abs/2310.08319.
  26. CODEC: complex document and entity collection. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 3067–3077. ACM.
  27. Www’18 open challenge: Financial opinion mining and question answering. In Companion of the The Web Conference 2018 on The Web Conference 2018, WWW 2018, Lyon , France, April 23-27, 2018, pages 1941–1942. ACM.
  28. Large language models know your contextual search intent: A prompting framework for conversational search. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 1211–1225. Association for Computational Linguistics.
  29. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 3470–3487. Association for Computational Linguistics.
  30. Webgpt: Browser-assisted question-answering with human feedback. CoRR, abs/2112.09332.
  31. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016, pages 280–290. ACL.
  32. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 1797–1807. Association for Computational Linguistics.
  33. MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, volume 1773 of CEUR Workshop Proceedings. CEUR-WS.org.
  34. Training language models to follow instructions with human feedback. In NeurIPS.
  35. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116.
  36. Introducing mantis: a novel multi-domain information seeking dialogues dataset. CoRR, abs/1912.04639.
  37. Webcpm: Interactive web search for chinese long-form question answering. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 8968–8988. Association for Computational Linguistics.
  38. GECOR: an end-to-end generative ellipsis and co-reference resolution model for task-oriented dialogue. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4546–4556. Association for Computational Linguistics.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  40. Squad: 100, 000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, pages 2383–2392. The Association for Computational Linguistics.
  41. Sudha Rao and Hal Daumé III. 2018. Learning to ask good questions: Ranking clarification questions using neural expected value of perfect information. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2737–2746. Association for Computational Linguistics.
  42. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In KDD ’20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23-27, 2020, pages 3505–3506. ACM.
  43. Multitask prompted training enables zero-shot task generalization. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  44. Towards facet-driven generation of clarifying questions for conversational search. In ICTIR ’21: The 2021 ACM SIGIR International Conference on the Theory of Information Retrieval, Virtual Event, Canada, July 11, 2021, pages 167–175. ACM.
  45. QUILL: query intent with large language models using retrieval augmentation and multi-stage distillation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: EMNLP 2022 - Industry Track, Abu Dhabi, UAE, December 7 - 11, 2022, pages 492–501. Association for Computational Linguistics.
  46. Is chatgpt good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 14918–14937. Association for Computational Linguistics.
  47. Prompt-based effective input reformulation for legal case retrieval. In Databases Theory and Applications - 34th Australasian Database Conference, ADC 2023, Melbourne, VIC, Australia, November 1-3, 2023, Proceedings, volume 14386 of Lecture Notes in Computer Science, pages 87–100. Springer.
  48. Mimics-duo: Offline & online evaluation of search clarification. In SIGIR ’22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, Madrid, Spain, July 11 - 15, 2022, pages 3198–3208. ACM.
  49. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. CoRR, abs/2104.08663.
  50. FEVER: a large-scale dataset for fact extraction and verification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pages 809–819. Association for Computational Linguistics.
  51. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  52. Ellen M. Voorhees. 2004. Overview of the TREC 2004 robust retrieval track. NIST Special Publication. National Institute of Standards and Technology (NIST).
  53. Ellen M. Voorhees. 2005. Overview of the TREC 2005 robust retrieval track. In Proceedings of the Fourteenth Text REtrieval Conference, TREC 2005, Gaithersburg, Maryland, USA, November 15-18, 2005, volume 500-266 of NIST Special Publication. National Institute of Standards and Technology (NIST).
  54. TREC-COVID: constructing a pandemic information retrieval test collection. SIGIR Forum, 54(1):1:1–1:12.
  55. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 241–251. Association for Computational Linguistics.
  56. Fact or fiction: Verifying scientific claims. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 7534–7550. Association for Computational Linguistics.
  57. Query2doc: Query expansion with large language models. pages 9414–9423.
  58. Super-naturalinstructions: Generalization via declarative instructions on 1600+ NLP tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5085–5109. Association for Computational Linguistics.
  59. Finetuned language models are zero-shot learners. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net.
  60. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2369–2380. Association for Computational Linguistics.
  61. MIMICS: A large-scale data collection for search clarification. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 3189–3196. ACM.
  62. Towards the law of capacity gap in distilling language models. CoRR, abs/2311.07052.
  63. Rankinggpt: Empowering large language models in text ranking with progressive enhancement. CoRR, abs/2311.16720.
  64. Large language models for information retrieval: A survey. CoRR, abs/2308.07107.
  65. Open-source large language models are strong zero-shot query likelihood models for document ranking. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 8807–8817. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yutao Zhu (63 papers)
  2. Peitian Zhang (23 papers)
  3. Chenghao Zhang (17 papers)
  4. Yifei Chen (58 papers)
  5. Binyu Xie (1 paper)
  6. Zhicheng Dou (113 papers)
  7. Zheng Liu (312 papers)
  8. Ji-Rong Wen (299 papers)
Citations (12)