Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Laws For Dense Retrieval (2403.18684v2)

Published 27 Mar 2024 in cs.IR and cs.CL

Abstract: Scaling up neural models has yielded significant advancements in a wide array of tasks, particularly in language generation. Previous studies have found that the performance of neural models frequently adheres to predictable scaling laws, correlated with factors such as training set size and model size. This insight is invaluable, especially as large-scale experiments grow increasingly resource-intensive. Yet, such scaling law has not been fully explored in dense retrieval due to the discrete nature of retrieval metrics and complex relationships between training data and model sizes in retrieval tasks. In this study, we investigate whether the performance of dense retrieval models follows the scaling law as other neural models. We propose to use contrastive log-likelihood as the evaluation metric and conduct extensive experiments with dense retrieval models implemented with different numbers of parameters and trained with different amounts of annotated data. Results indicate that, under our settings, the performance of dense retrieval models follows a precise power-law scaling related to the model size and the number of annotations. Additionally, we examine scaling with prevalent data augmentation methods to assess the impact of annotation quality, and apply the scaling law to find the best resource allocation strategy under a budget constraint. We believe that these insights will significantly contribute to understanding the scaling effect of dense retrieval models and offer meaningful guidance for future research endeavors.

Exploring Scaling Laws for Dense Retrieval

Introduction to Scaling Laws in Dense Retrieval

Recent advancements in neural network research, particularly in the domain of NLP and information retrieval (IR), have underscored the significance of scaling up neural models. A well-acknowledged phenomenon across various tasks involving LLMs is the adherence to predictable scaling laws—relationships that predict model performance based on factors like model size and training data volume. These laws have proven invaluable, especially given the resource-intensive nature of conducting large-scale experiments. Yet, the exploration of such scaling laws has remained relatively untapped in the field of dense retrieval. This gap is noteworthy considering the pivotal role of dense retrieval models in improving semantic search capabilities over conventional retrieval methods.

The paper by Yan Fang, Jingtao Zhan, Qingyao Ai, and colleagues embarks on this exploration, investigating whether dense retrieval models' performance exhibits similar scaling behavior as observed in other neural models. They delve into the intricate balance between model size, training data volume, and the intriguing possibility of scaling laws existing in the universe of dense retrieval, despite its inherent discrete evaluation metrics and the complex interplay between model and data sizes.

Key Findings and Methodology

A significant contribution of their research is the proposition of using contrastive log-likelihood as an evaluation metric, which mirrors the continuous nature of scaling laws observed in LLMs. Their extensive experimentation across different model sizes and volumes of annotated data elucidates a precise power-law scaling in dense retrieval performance. This revelation is not only theoretically intriguing but also holds practical implications in optimizing resource allocations for training dense retrieval models.

Model and Data Size Scaling

The paper effectively disentangles the effects of model and data sizes, offering insights into their individual and combined impacts on dense retrieval:

  • Model Size: The research observes a clear, predictable relationship between model size and retrieval performance, quantified through contrastive perplexity. This scaling behavior underscores the potential of larger models in achieving superior retrieval capabilities, albeit with diminishing returns.
  • Data Size: Similar scaling laws emerge when varying the amount of annotated data, indicating that more extensive training datasets generally enhance model performance, albeit at a diminishing rate of improvement. This finding highlights the critical role of data volume in training effective dense retrieval systems.

Annotation Quality and Scaling Laws

Exploring the dimension of data annotation quality sheds light on its influence on scaling laws. The paper reveals that annotation quality, from weak to high, notably impacts the scaling effect, with higher-quality annotations yielding steeper improvements in model performance. This finding accentuates the value of high-quality annotations and the potential of leveraging advanced LLMs for generating such annotations.

Implications and Future Directions

The identification of scaling laws in dense retrieval models extends beyond academic curiosity, offering tangible benefits for guiding future research and developments in IR. It provides a framework for predicting model performance under various configurations, enabling more efficient allocation of computational resources and informed decisions on data annotation strategies.

Looking ahead, this work opens several avenues for further exploration:

  • Extending Model and Data Size Ranges: Investigating scaling laws over a broader spectrum of model sizes and data volumes could provide deeper insights into their limitations and applicability.
  • Diverse Architectures and Tasks: Exploring scaling laws across different neural architectures and retrieval tasks could uncover task-specific scaling behavior, enriching our understanding of dense retrieval systems.
  • Practical Applications: The paper's findings on optimal resource allocation under budget constraints have immediate practical applications in designing efficient and scalable dense retrieval systems, paving the way for more cost-effective implementations in commercial search engines.

In conclusion, this pioneering exploration of scaling laws in dense retrieval marks a significant step forward in our understanding of neural information retrieval systems. It lays the groundwork for future research aimed at optimizing the design and training of dense retrieval models, ultimately advancing the state-of-the-art in semantic search technologies.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Lada A Adamic and Bernardo A Huberman. 2002. Zipf’s law and the Internet. Glottometrics 3, 1 (2002), 143–150.
  2. Annotating Data for Fine-Tuning a Neural Ranker? Current Active Learning Strategies are not Better than Random Selection. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 139–149.
  3. Understanding Scaling Laws for Recommendation Models. arXiv preprint arXiv:2208.08489 (2022).
  4. Unified scaling laws for routed language models. In International Conference on Machine Learning. PMLR, 4057–4086.
  5. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
  6. Scaling vision transformers to 22 billion parameters. In International Conference on Machine Learning. PMLR, 7480–7512.
  7. Luyu Gao and Jamie Callan. 2021a. Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021).
  8. Luyu Gao and Jamie Callan. 2021b. Unsupervised corpus aware language model pre-training for dense passage retrieval. arXiv preprint arXiv:2108.05540 (2021).
  9. Precise zero-shot dense retrieval without relevance labels. arXiv preprint arXiv:2212.10496 (2022).
  10. Scaling laws for reward model overoptimization. In International Conference on Machine Learning. PMLR, 10835–10866.
  11. Alexander Gelbukh and Grigori Sidorov. 2001. Zipf and Heaps laws’ coefficients depend on language. In Computational Linguistics and Intelligent Text Processing: Second International Conference, CICLing 2001 Mexico City, Mexico, February 18–24, 2001 Proceedings 2. Springer, 332–335.
  12. Semantic models for the first-stage retrieval: A comprehensive review. arXiv preprint arXiv:2103.04831 (2021).
  13. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017).
  14. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
  15. Efficiently teaching an effective dense retriever with balanced topic aware sampling. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 113–122.
  16. Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118 (2021).
  17. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning. PMLR, 4904–4916.
  18. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
  19. Omar Khattab and Matei Zaharia. 2020. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 39–48.
  20. Jonathan C Lansey and Bruce Bukiet. 2009. Internet Search Result Probabilities: Heaps’ Law and Word Associativity. Journal of Quantitative Linguistics 16, 1 (2009), 40–66.
  21. Pretrained transformers for text ranking: Bert and beyond. Synthesis Lectures on Human Language Technologies 14, 4 (2021), 1–325.
  22. Zheng Liu and Yingxia Shao. 2022. RetroMAE: Pre-training Retrieval-oriented Transformers via Masked Auto-Encoder. arXiv preprint arXiv:2205.12035 (2022).
  23. Zipf’s law leads to Heaps’ law: Analyzing their relation in finite-size systems. PloS one 5, 12 (2010), e14139.
  24. Sparse, dense, and attentional representations for text retrieval. Transactions of the Association for Computational Linguistics 9 (2021), 329–345.
  25. Pre-Train a Discriminative Text Encoder for Dense Retrieval via Contrastive Span Prediction. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Madrid, Spain) (SIGIR ’22). Association for Computing Machinery, New York, NY, USA, 848–858. https://doi.org/10.1145/3477495.3531772
  26. Incorporating Structural Information into Legal Case Retrieval. ACM Transactions on Information Systems 42, 2 (2023), 1–28.
  27. Mark EJ Newman. 2005. Power laws, Pareto distributions and Zipf’s law. Contemporary physics 46, 5 (2005), 323–351.
  28. MS MARCO: A human generated machine reading comprehension dataset. In CoCo@ NIPS.
  29. Large dual encoders are generalizable retrievers. arXiv preprint arXiv:2112.07899 (2021).
  30. From doc2query to docTTTTTquery. Online preprint 6 (2019), 2.
  31. Combined scaling for zero-shot transfer learning. Neurocomputing 555 (2023), 126658.
  32. RocketQA: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191 (2020).
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
  34. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning. PMLR, 28492–28518.
  35. No parameter left behind: How distillation and model size affect zero-shot retrieval. arXiv preprint arXiv:2206.02873 (2022).
  36. ColBERTv2: Effective and Efficient Retrieval via Lightweight Late Interaction. arXiv:2112.01488 [cs.IR]
  37. Understanding Relevance Judgments in Legal Case Retrieval. ACM Transactions on Information Systems 41, 3 (2023), 1–32.
  38. Energy and Policy Considerations for Deep Learning in NLP. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Anna Korhonen, David Traum, and Lluís Màrquez (Eds.). Association for Computational Linguistics, Florence, Italy, 3645–3650. https://doi.org/10.18653/v1/P19-1355
  39. T2Ranking: A large-scale Chinese Benchmark for Passage Ranking. arXiv preprint arXiv:2304.03679 (2023).
  40. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808 (2020).
  41. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12104–12113.
  42. Optimizing dense retrieval model training with hard negatives. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1503–1512.
  43. Scaling Law of Large Sequential Recommendation Models. arXiv preprint arXiv:2311.11351 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yan Fang (20 papers)
  2. Jingtao Zhan (17 papers)
  3. Qingyao Ai (113 papers)
  4. Jiaxin Mao (47 papers)
  5. Weihang Su (27 papers)
  6. Jia Chen (85 papers)
  7. Yiqun Liu (131 papers)
Citations (8)
Reddit Logo Streamline Icon: https://streamlinehq.com