Scaling Retrieval-Based LLMs with a Trillion-Token Datastore
The research presented in this paper explores an additional dimension in the development of LLMs (LMs): the size of data available to retrieval-based models at inference time. The central premise is that increasing the size of a datastore from which an LM can retrieve information enhances its performance across various tasks. A trillion-token datastore, MassiveDS, was constructed for this paper, providing the largest open-sourced and most diverse repository available for retrieval-based LMs.
Key Findings
- Performance Over Scale: The increase in datastore size was found to significantly improve the capabilities of LMs in both LLMing and several downstream tasks. Notably, improvements did not exhibit saturation within the scale tested. Smaller models, when paired with large datastores, can outperform larger models lacking retrieval augmentation, specifically in knowledge-intensive tasks.
- Compute-Optimal Scaling: The introduction of retrieval datastore scaling demonstrates superior results for the same computational budget used in training. The paper emphasizes that indexing a datastore incurs substantially lower computational costs compared with expanding the number of training tokens for an LM. This suggests that datastore scaling is a cost-effective method to enhance model performance.
- Diverse Data Impacts: MassiveDS, a collection encompassing over 1.4 trillion tokens spanning eight different domains, ensures broad knowledge coverage. The research reveals that models using MassiveDS perform better or equivalent to those with single-domain datastores, highlighting the robust benefits of a multi-domain datastore which encourages automatic domain relevance retrieval.
- Effect of Reranking and Filtering: Additional experiments in the paper show that using advanced reranking methods can further improve retrieval quality, indicating ample room for enhancements in this area. Moreover, data deduplication and filtering using quality metrics were vital for maintaining datastore efficiency and relevance, especially as datastore size scales up.
Practical and Theoretical Implications
- Performance and Efficiency: The findings highlight a paradigm shift, suggesting that optimal performance is not merely reliant on increasing model size and training data, but rather on utilizing expansive datastores effectively during inference. This could lead to more resource-efficient strategies for deploying large-scale LMs, particularly in contexts with limited compute capacity.
- Generalization Across Domains: The capability of retrieval-based models to leverage large and diverse datastores implies a more generalized understanding across varied domains, providing significant inroads toward the development of universal models that can handle a broader range of tasks with a single configuration.
- Future Directions in Retrieval-Based LMs: The research opens pathways toward further exploration in advanced retrieval techniques, including better retrievers and rerankers which complement larger scaling datastore strategies. Additionally, extending this work to evaluate more complex tasks such as long-form generation and deep mathematical reasoning can provide further insights.
Conclusion
This paper significantly contributes to understanding how datastore scaling plays a pivotal role in shaping the efficacy and efficiency of retrieval-based LLMs. By integrating large-scale, multi-domain datastores at inference time, retrieval-augmented LMs show promise in achieving superior performance without the prohibitive costs associated with parameter and data scaling alone. This new dimension of scaling posits exciting opportunities for future advancements and practical applications in AI.