Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers (2403.18276v2)

Published 27 Mar 2024 in cs.IR and cs.CL
RankMamba: Benchmarking Mamba's Document Ranking Performance in the Era of Transformers

Abstract: Transformer structure has achieved great success in multiple applied machine learning communities, such as NLP, computer vision (CV) and information retrieval (IR). Transformer architecture's core mechanism -- attention requires $O(n2)$ time complexity in training and $O(n)$ time complexity in inference. Many works have been proposed to improve the attention mechanism's scalability, such as Flash Attention and Multi-query Attention. A different line of work aims to design new mechanisms to replace attention. Recently, a notable model structure -- Mamba, which is based on state space models, has achieved transformer-equivalent performance in multiple sequence modeling tasks. In this work, we examine \mamba's efficacy through the lens of a classical IR task -- document ranking. A reranker model takes a query and a document as input, and predicts a scalar relevance score. This task demands the LLM's ability to comprehend lengthy contextual inputs and to capture the interaction between query and document tokens. We find that (1) Mamba models achieve competitive performance compared to transformer-based models with the same training recipe; (2) but also have a lower training throughput in comparison to efficient transformer implementations such as flash attention. We hope this study can serve as a starting point to explore Mamba models in other classical IR tasks. Our code implementation and trained checkpoints are made public to facilitate reproducibility (https://github.com/zhichaoxu-shufe/RankMamba).

Benchmarking Mamba's Document Ranking Performance Against Transformers

Comparative Evaluation of Document Ranking Models

In the sphere of information retrieval (IR), the emergence of transformer-based LLMs has significantly reshaped the way we understand and process natural language data. The paper conducted by Zhichao Xu et al., focuses on evaluating the performance of a recent model structure, Mamba, within the context of the classical IR task of document ranking. The outcomes of this exploration provide nuanced insights into the competitive landscape of LLMs in terms of efficiency and efficacy.

Background and Model Overview

Transformer architectures have heralded advancements across various machine learning applications, notable for their capacity to capture long-range dependencies within sequences. Despite their success, the quadratic computational complexity of the attention mechanism has prompted efforts to devise more scalable alternatives. A noteworthy development in this endeavor is the Mamba model, which operates on the principles of Selective State Space Models (SSMs) to foster transformer-equivalent performance while aiming for superior computational efficiency.

Research Questions and Methodology

The core objective of the paper was to ascertain whether Mamba models could offer performance on par with or superior to transformer-based models in document ranking tasks. The investigation entailed a rigorous benchmarking process, pitting Mamba against a diverse array of transformer-based models, including encoder-only, decoder-only, and encoder-decoder frameworks across different scales. The benchmark focused on models with varying pre-training objectives, sizes, and attention mechanisms, employing established training recipes and evaluating their performance through the lens of the document ranking task. This task necessitates a model's ability to discern and quantify the relevance between queries and documents, demanding both comprehensive understanding and contextual interpretation capabilities from the underlying LLM.

Key Findings

The empirical analysis revealed several critical findings:

  • Encoder-only transformer models demonstrated robust performance in document ranking tasks, with roberta-large notably outperforming its counterparts in terms of the MRR metric on the MSMARCO Dev set.
  • Mamba models showcased competitive performance, sometimes matching or surpassing the transformer-based models' effectiveness. This is a considerable achievement, emphasizing Mamba's potential in handling complex IR tasks.
  • However, it was observed that Mamba models suffer from lower training throughput compared to advanced transformer implementations incorporating efficient attention mechanisms such as Flash Attention.

Implications and Future Directions

The findings from this paper underscore Mamba models' viability as a potent alternative to transformer-based models for document ranking tasks, hinting at their broader applicability across classical IR tasks. Nonetheless, the noted deficiency in training throughput for Mamba models compared to some transformer models signifies a potential area for future optimization. This limitation does not diminish Mamba’s achievements but rather highlights a trajectory for enhancing its implementation to fully leverage its efficiency and scalability advantages.

Advancing Mamba's computational efficiency without compromising its performance could redefine the benchmarks for LLM deployments in IR, offering a blend of efficacy and efficiency. As the IR field continues to evolve, the exploration of models like Mamba, which challenge the status quo and push the boundaries of computational efficiency, remains crucial in our ongoing quest to develop more capable, scalable, and efficient language processing systems.

Concluding Remarks

The paper’s exploration into Mamba models within the domain of document ranking presents a promising avenue for future research. The competitive performance of Mamba models, juxtaposed with their current limitations in training throughput, offers a nuanced perspective on the potential and challenges of deploying SSM-based models in IR tasks. As we move forward, refining these models and overcoming their limitations will be paramount in harnessing their full potential, paving the way for their broader application across the diverse ecosystem of IR tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (51)
  1. “Asking clarifying questions in open-domain information-seeking conversations” In Proceedings of the 42nd international acm sigir conference on research and development in information retrieval, 2019, pp. 475–484
  2. “Efficient index-based snippet generation” In ACM Transactions on Information Systems (TOIS) 32.2 ACM New York, NY, USA, 2014, pp. 1–24
  3. “Pythia: A suite for analyzing large language models across training and scaling” In International Conference on Machine Learning, 2023, pp. 2397–2430 PMLR
  4. Guy E Blelloch “Pre x sums and their applications” In Synthesis of Parallel Algorithms, pp. 35–60
  5. “Understanding performance of long-document ranking models through comprehensive evaluation and leaderboarding” In arXiv preprint arXiv:2207.01262, 2022
  6. Iz Beltagy, Matthew E Peters and Arman Cohan “Longformer: The long-document transformer” In arXiv preprint arXiv:2004.05150, 2020
  7. “Scaling instruction-finetuned language models” In arXiv preprint arXiv:2210.11416, 2022
  8. “Overview of the TREC 2019 deep learning track” In arXiv preprint arXiv:2003.07820, 2020
  9. “Overview of the TREC 2020 deep learning track. CoRR abs/2102.07662 (2021)” In arXiv preprint arXiv:2102.07662, 2021
  10. Tri Dao “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning” In The Twelfth International Conference on Learning Representations, 2024 URL: https://openreview.net/forum?id=mZn2Xyh9Ec
  11. “Deeper text understanding for IR with contextual neural language modeling” In Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, 2019, pp. 985–988
  12. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding” In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) Minneapolis, Minnesota: Association for Computational Linguistics, 2019, pp. 4171–4186 DOI: 10.18653/v1/N19-1423
  13. “Mamba: Linear-time sequence modeling with selective state spaces” In arXiv preprint arXiv:2312.00752, 2023
  14. Luyu Gao, Zhuyun Dai and Jamie Callan “Rethink training of BERT rerankers in multi-stage retrieval pipeline” In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43, 2021, pp. 280–286 Springer
  15. Ankit Gupta, Albert Gu and Jonathan Berant “Diagonal state spaces are as effective as structured state spaces” In Advances in Neural Information Processing Systems 35, 2022, pp. 22982–22994
  16. Albert Gu, Karan Goel and Christopher Re “Efficiently Modeling Long Sequences with Structured State Spaces” In International Conference on Learning Representations, 2021
  17. “Hippo: Recurrent memory with optimal polynomial projections” In Advances in neural information processing systems 33, 2020, pp. 1474–1487
  18. “Combining recurrent, convolutional, and continuous-time models with linear state space layers” In Advances in neural information processing systems 34, 2021, pp. 572–585
  19. “On the parameterization and initialization of diagonal state space models” In Advances in Neural Information Processing Systems 35, 2022, pp. 35971–35983
  20. “Intra-document cascading: learning to select passages for neural document ranking” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 1349–1358
  21. “Long short-term memory” In Neural computation 9.8 MIT press, 1997, pp. 1735–1780
  22. “LoRA: Low-Rank Adaptation of Large Language Models” In International Conference on Learning Representations, 2021
  23. “What Language Model to Train if You Have One Million GPU Hours?” In Findings of the Association for Computational Linguistics: EMNLP 2022 Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, 2022, pp. 765–782 DOI: 10.18653/v1/2022.findings-emnlp.54
  24. “Decoupled Weight Decay Regularization” In International Conference on Learning Representations, 2019 URL: https://openreview.net/forum?id=Bkg6RiCqY7
  25. “PARADE: Passage Representation Aggregation forDocument Reranking” In ACM Transactions on Information Systems 42.2 ACM New York, NY, USA, 2023, pp. 1–26
  26. “Pyserini: A Python toolkit for reproducible information retrieval research with sparse and dense representations” In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2356–2362
  27. “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
  28. Jimmy Lin, Rodrigo Nogueira and Andrew Yates “Pretrained transformers for text ranking: Bert and beyond” Springer Nature, 2022
  29. “Fine-tuning llama for multi-stage text retrieval” In arXiv preprint arXiv:2310.08319, 2023
  30. “Parallelizing Linear Recurrent Neural Nets Over Sequence Length” In International Conference on Learning Representations, 2018
  31. “Distributed representations of words and phrases and their compositionality” In Advances in neural information processing systems 26, 2013
  32. “Sentence-T5: Scalable Sentence Encoders from Pre-trained Text-to-Text Models” In Findings of the Association for Computational Linguistics: ACL 2022 Dublin, Ireland: Association for Computational Linguistics, 2022, pp. 1864–1874 DOI: 10.18653/v1/2022.findings-acl.146
  33. “RWKV: Reinventing RNNs for the Transformer Era” In Findings of the Association for Computational Linguistics: EMNLP 2023 Singapore: Association for Computational Linguistics, 2023, pp. 14048–14077 DOI: 10.18653/v1/2023.findings-emnlp.936
  34. “Deep Contextualized Word Representations” In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) New Orleans, Louisiana: Association for Computational Linguistics, 2018, pp. 2227–2237 DOI: 10.18653/v1/N18-1202
  35. “Exploring the limits of transfer learning with a unified text-to-text transformer” In The Journal of Machine Learning Research 21.1 JMLRORG, 2020, pp. 5485–5551
  36. “The probabilistic relevance framework: BM25 and beyond” In Foundations and Trends® in Information Retrieval 3.4 Now Publishers, Inc., 2009, pp. 333–389
  37. Jimmy TH Smith, Andrew Warrington and Scott Linderman “Simplified State Space Layers for Sequence Modeling” In The Eleventh International Conference on Learning Representations, 2022
  38. “Llama: Open and efficient foundation language models” In arXiv preprint arXiv:2302.13971, 2023
  39. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  40. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 353–355 DOI: 10.18653/v1/W18-5446
  41. “What language model architecture and pretraining objective works best for zero-shot generalization?” In International Conference on Machine Learning, 2022, pp. 22964–22984 PMLR
  42. “An in-depth investigation of user response simulation for conversational search” In arXiv preprint arXiv:2304.07944, 2023
  43. “Zero-shot clarifying question generation for conversational search” In Proceedings of the ACM Web Conference 2023, 2023, pp. 3288–3298
  44. “A lightweight constrained generation alternative for query-focused summarization” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 1745–1749
  45. “Counterfactual Editing for Search Result Explanation” In arXiv preprint arXiv:2301.10389, 2023
  46. Zhichao Xu “Context-aware Decoding Reduces Hallucination in Query-focused Summarization” In arXiv preprint arXiv:2312.14335, 2023
  47. Zhichao Xu, Hansi Zeng and Qingyao Ai “Understanding the effectiveness of reviews in e-commerce top-n recommendation” In Proceedings of the 2021 ACM SIGIR International Conference on Theory of Information Retrieval, 2021, pp. 149–155
  48. Puxuan Yu, Razieh Rahimi and James Allan “Towards explainable search results: a listwise explanation generator” In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2022, pp. 669–680
  49. “Opt: Open pre-trained transformer language models” In arXiv preprint arXiv:2205.01068, 2022
  50. “Rankt5: Fine-tuning t5 for text ranking with ranking losses” In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2023, pp. 2308–2313
  51. Hansi Zeng, Zhichao Xu and Qingyao Ai “A zero attentive relevance matching network for review modeling in recommendation system” In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part I 43, 2021, pp. 724–739 Springer
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Zhichao Xu (30 papers)
Citations (6)