Scaling Retrieval-Based Language Models with a Trillion-Token Datastore (2407.12854v1)

Published 9 Jul 2024 in cs.CL, cs.AI, cs.IR, and cs.LG

Abstract: Scaling laws with respect to the amount of training data and the number of parameters allow us to predict the cost-benefit trade-offs of pretraining LLMs (LMs) in different configurations. In this paper, we consider another dimension of scaling: the amount of data available at inference time. Specifically, we find that increasing the size of the datastore used by a retrieval-based LM monotonically improves LLMing and several downstream tasks without obvious saturation, such that a smaller model augmented with a large datastore outperforms a larger LM-only model on knowledge-intensive tasks. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget. We carry out our study by constructing a 1.4 trillion-token datastore named MassiveDS, which is the largest and the most diverse open-sourced datastore for retrieval-based LMs to date, and designing an efficient pipeline for studying datastore scaling in a computationally accessible manner. Finally, we analyze the effect of improving the retriever, datastore quality filtering, and other design choices on our observed scaling trends. Overall, our results show that datastore size should be considered as an integral part of LM efficiency and performance trade-offs. To facilitate future research, we open-source our datastore and code at https://github.com/RulinShao/retrieval-scaling.

PDF HTML Abstract

Scaling Retrieval-Based LLMs with a Trillion-Token Datastore

The research presented in this paper explores an additional dimension in the development of LLMs (LMs): the size of data available to retrieval-based models at inference time. The central premise is that increasing the size of a datastore from which an LM can retrieve information enhances its performance across various tasks. A trillion-token datastore, MassiveDS, was constructed for this paper, providing the largest open-sourced and most diverse repository available for retrieval-based LMs.

Key Findings

Performance Over Scale: The increase in datastore size was found to significantly improve the capabilities of LMs in both LLMing and several downstream tasks. Notably, improvements did not exhibit saturation within the scale tested. Smaller models, when paired with large datastores, can outperform larger models lacking retrieval augmentation, specifically in knowledge-intensive tasks.
Compute-Optimal Scaling: The introduction of retrieval datastore scaling demonstrates superior results for the same computational budget used in training. The paper emphasizes that indexing a datastore incurs substantially lower computational costs compared with expanding the number of training tokens for an LM. This suggests that datastore scaling is a cost-effective method to enhance model performance.
Diverse Data Impacts: MassiveDS, a collection encompassing over 1.4 trillion tokens spanning eight different domains, ensures broad knowledge coverage. The research reveals that models using MassiveDS perform better or equivalent to those with single-domain datastores, highlighting the robust benefits of a multi-domain datastore which encourages automatic domain relevance retrieval.
Effect of Reranking and Filtering: Additional experiments in the paper show that using advanced reranking methods can further improve retrieval quality, indicating ample room for enhancements in this area. Moreover, data deduplication and filtering using quality metrics were vital for maintaining datastore efficiency and relevance, especially as datastore size scales up.

Practical and Theoretical Implications

Performance and Efficiency: The findings highlight a paradigm shift, suggesting that optimal performance is not merely reliant on increasing model size and training data, but rather on utilizing expansive datastores effectively during inference. This could lead to more resource-efficient strategies for deploying large-scale LMs, particularly in contexts with limited compute capacity.
Generalization Across Domains: The capability of retrieval-based models to leverage large and diverse datastores implies a more generalized understanding across varied domains, providing significant inroads toward the development of universal models that can handle a broader range of tasks with a single configuration.
Future Directions in Retrieval-Based LMs: The research opens pathways toward further exploration in advanced retrieval techniques, including better retrievers and rerankers which complement larger scaling datastore strategies. Additionally, extending this work to evaluate more complex tasks such as long-form generation and deep mathematical reasoning can provide further insights.

Conclusion

This paper significantly contributes to understanding how datastore scaling plays a pivotal role in shaping the efficacy and efficiency of retrieval-based LLMs. By integrating large-scale, multi-domain datastores at inference time, retrieval-augmented LMs show promise in achieving superior performance without the prohibitive costs associated with parameter and data scaling alone. This new dimension of scaling posits exciting opportunities for future advancements and practical applications in AI.