Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAR-b: Reasoning as Retrieval Benchmark (2404.06347v2)

Published 9 Apr 2024 in cs.CL and cs.IR

Abstract: Semantic textual similartiy (STS) and information retrieval tasks (IR) tasks have been the two major avenues to record the progress of embedding models in the past few years. Under the emerging Retrieval-augmented Generation (RAG) paradigm, we envision the need to evaluate next-level language understanding abilities of embedding models, and take a conscious look at the reasoning abilities stored in them. Addressing this, we pose the question: Can retrievers solve reasoning problems? By transforming reasoning tasks into retrieval tasks, we find that without specifically trained for reasoning-level language understanding, current state-of-the-art retriever models may still be far from being competent for playing the role of assisting LLMs, especially in reasoning-intensive tasks. Moreover, albeit trained to be aware of instructions, instruction-aware IR models are often better off without instructions in inference time for reasoning tasks, posing an overlooked retriever-LLM behavioral gap for the research community to align. However, recent decoder-based embedding models show great promise in narrowing the gap, highlighting the pathway for embedding models to achieve reasoning-level language understanding. We also show that, although current off-the-shelf re-ranker models fail on these tasks, injecting reasoning abilities into them through fine-tuning still appears easier than doing so to bi-encoders, and we are able to achieve state-of-the-art performance across all tasks by fine-tuning a reranking model. We release Reasoning as Retrieval Benchmark (RAR-b), a holistic suite of tasks and settings to evaluate the reasoning abilities stored in retriever models. RAR-b is available at https://github.com/gowitheflow-1998/RAR-b.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. * sem 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity, pp.  32–43, 2013.
  2. Semeval-2014 task 10: Multilingual semantic textual similarity. In Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014), pp.  81–91, 2014.
  3. Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015), pp.  252–263, 2015.
  4. Semeval-2016 task 1: Semantic textual similarity, monolingual and cross-lingual evaluation. In SemEval-2016. 10th International Workshop on Semantic Evaluation; 2016 Jun 16-17; San Diego, CA. Stroudsburg (PA): ACL; 2016. p. 497-511. ACL (Association for Computational Linguistics), 2016.
  5. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260, 2022.
  6. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  7. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  8. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  9. The reversal curse: Llms trained on “a is b” fail to learn “b is a”. In The Twelfth International Conference on Learning Representations, 2023.
  10. Abductive commonsense reasoning. In International Conference on Learning Representations, 2020.
  11. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  7432–7439, 2020.
  12. Semeval-2017 task 1: Semantic textual similarity multilingual and cross-lingual focused evaluation. In The 11th International Workshop on Semantic Evaluation (SemEval-2017), pp.  1–14, 2017.
  13. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216, 2024.
  14. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  15. Combining retrieval, statistics, and inference to answer elementary science questions. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  16. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  17. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  18. Csts: Conditional semantic textual similarity. arXiv preprint arXiv:2305.15093, 2023.
  19. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  20. Measuring mathematical problem solving with the math dataset. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  21. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  22. Codesearchnet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019.
  23. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
  24. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  25. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  26. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pp.  15696–15707. PMLR, 2023.
  27. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  28. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  29. How to train your dragon: Diverse augmentation towards generalizable dense retrieval. arXiv preprint arXiv:2302.07452, 2023.
  30. Unicorn on rainbow: A universal commonsense reasoning model on a new multitask benchmark. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  13480–13488, 2021.
  31. Query rewriting for retrieval-augmented large language models. arXiv preprint arXiv:2305.14283, 2023.
  32. Expertqa: Expert-curated questions and attributed answers. arXiv preprint arXiv:2309.07852, 2023.
  33. Spartqa: A textual question answering benchmark for spatial reasoning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4582–4598, 2021.
  34. Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904, 2022.
  35. Octopack: Instruction tuning code large language models. arXiv preprint arXiv:2308.07124, 2023a.
  36. Mteb: Massive text embedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.  2014–2037, 2023b.
  37. Generative representational instruction tuning. arXiv preprint arXiv:2402.09906, 2024.
  38. Text and code embeddings by contrastive pre-training. arXiv preprint arXiv:2201.10005, 2022.
  39. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  9844–9855, 2022.
  40. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  41. OpenAI. Gpt-4. 2023. URL .https://openai.com/research/gpt-4.
  42. Rankzephyr: Effective and robust zero-shot listwise reranking is a breeze! arXiv preprint arXiv:2312.02724, 2023.
  43. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  44. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  3982–3992, 2019.
  45. Task-oriented intrinsic evaluation of semantic textual similarity. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp.  87–96, 2016.
  46. Getting closer to ai complete question answering: A set of prerequisite real tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  8722–8731, 2020.
  47. Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
  48. Social iqa: Commonsense reasoning about social interactions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  4463–4473, 2019.
  49. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741, 2022.
  50. Is chatgpt good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  14918–14937, 2023.
  51. Towards benchmarking and improving the temporal reasoning capability of large language models. arXiv preprint arXiv:2306.08952, 2023.
  52. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  53. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  54. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214, 2023.
  55. Tsdae: Using transformer-based sequential denoising auto-encoderfor unsupervised sentence embedding learning. In Findings of the Association for Computational Linguistics: EMNLP 2021, pp.  671–688, 2021.
  56. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2023a.
  57. What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.  22964–22984. PMLR, 2022a.
  58. Knowledgpt: Enhancing large language models with retrieval and storage access on knowledge bases. arXiv preprint arXiv:2308.11761, 2023b.
  59. Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5085–5109, 2022b.
  60. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  61. Emergent abilities of large language models. Transactions on Machine Learning Research, 2022.
  62. Length is a curse and a blessing for document-level semantics. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  1385–1396, 2023a.
  63. On isotropy, contextualization and learning dynamics of contrastive-based sentence representation learning. In Findings of the Association for Computational Linguistics: ACL 2023, pp.  12266–12283, 2023b.
  64. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597, 2023c.
  65. Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025, 2023. URL https://arxiv.org/abs/2310.03025.
  66. Answering questions by meta-reasoning over multiple chains of thought. arXiv preprint arXiv:2304.13007, 2023.
  67. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  68. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp.  4791–4800, 2019.
  69. Iag: Induction-augmented generation framework for answering reasoning questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  1–14, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Chenghao Xiao (21 papers)
  2. G Thomas Hudson (8 papers)
  3. Noura Al Moubayed (40 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com

HackerNews

  1. RAR-B: Reasoning as Retrieval Benchmark (11 points, 0 comments)