Improving language models by retrieving from trillions of tokens (2112.04426v3)

Published 8 Dec 2021 in cs.CL and cs.LG

Abstract: We enhance auto-regressive LLMs by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a $2$ trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25$\times$ fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving LLMs through explicit memory at unprecedented scale.

PDF Abstract

An Overview of RETRO: Enhancing LLMs with Retrieval-Augmented Transformer Blocks

The focus of this paper is on the innovative approach to augmenting large-scale LLMs, particularly transforming their architecture using retrieval-augmented transformer (RETRO) blocks. The paper provides a thorough examination of RETRO's impact on model performance, parameter efficiency, and introduces new methodologies for integrating retrieval mechanisms within transformer models. Detailed numerical evaluations across multiple datasets substantiate their findings.

RETRO Architecture and Methodology

The RETRO model diverges from conventional transformer-based models by incorporating retrieval mechanisms to access relevant document chunks during the forward pass. This process is driven by the following integral components:

Frozen kNN Retriever: A pre-trained retrieval model retrieves the nearest neighbor document chunks relevant to the input tokens without fine-tuning during the training of the LLM.
Chunked Cross-Attention (CCA): This mechanism allows the model to attend to the encoded retrieved neighbors. Specifically, by deploying chunked cross-attention, the model can harness context from the retrieved information effectively.
RETRO Blocks: These blocks, integrated within the transformer layers, combine inputs with retrieved contexts, subsequently processed through feed-forward networks. This design ensures the model scales effectively with the size of the retrieval database.

Empirical Performance Analysis

The evaluation of RETRO spans multiple datasets, including Wikipedia, OpenWebText, and more domain-specific datasets like arXiv and PubMed abstracts.

Key Numerical Results:

LAMBADA Accuracy: Consistently high accuracy was observed across different model sizes (172M, 425M, 1.5B, 7.5B parameters), indicating effective context retrieval mechanisms.
Perplexity Metrics: There was notable improvement in perplexity scores on various corpora such as Wikitext103:
- 0.70 vs 0.50 (172M RETRO [ON] vs Baseline)
- 0.65 vs 0.60 (1.5B RETRO [ON] vs Baseline)
Bits-Per-Byte (bpb) Reduction: Significant bpb reduction was noted when implementing RETRO on large datasets, highlighting the compression efficiency and reduced redundancy:
- Relatively better bpb on Wikipedia September 2021 dataset with larger parameter models, from 0.60 to 0.85 depending on retrieval parameters.

Implications and Future Work

Theoretical Implications:

The RETRO model’s architecture demonstrates that retrieval-augmented approaches can mitigate some scaling limitations faced by traditional transformers. The chunked cross-attention mechanism adds a layer of dynamic context integration which could pave the way for more adaptive LLMs.

Practical Implications:

On a practical level, integrating RETRO blocks could improve real-world applications such as conversational agents, question-answering systems, and text summarization tools. This enhancement is particularly relevant for domains requiring access to large, dynamic knowledge bases.

Future Developments:

Future research could investigate optimizing the retrieval mechanisms further, focusing on faster kNN retrieval processes and refining the chunk selection strategies. Additionally, exploring RETRO's application in multitask learning scenarios and its potential in low-resource languages provides promising directions for the continued evolution of LLMs.

In conclusion, the paper positions RETRO as a formidable enhancement over traditional transformer models by effectively integrating retrieval mechanisms, demonstrating substantial improvements in model performance and parameter efficiency. The exploration into retrieval-augmented architectures such as RETRO holds substantial promise for future advancements in the field of natural language processing.