Generalization through Memorization: Nearest Neighbor Language Models (1911.00172v2)

Published 1 Nov 2019 in cs.CL

Abstract: We introduce $k$NN-LMs, which extend a pre-trained neural LLM (LM) by linearly interpolating it with a $k$-nearest neighbors ($k$NN) model. The nearest neighbors are computed according to distance in the pre-trained LM embedding space, and can be drawn from any text collection, including the original LM training data. Applying this augmentation to a strong Wikitext-103 LM, with neighbors drawn from the original training set, our $k$NN-LM achieves a new state-of-the-art perplexity of 15.79 - a 2.9 point improvement with no additional training. We also show that this approach has implications for efficiently scaling up to larger training sets and allows for effective domain adaptation, by simply varying the nearest neighbor datastore, again without further training. Qualitatively, the model is particularly helpful in predicting rare patterns, such as factual knowledge. Together, these results strongly suggest that learning similarity between sequences of text is easier than predicting the next word, and that nearest neighbor search is an effective approach for LLMing in the long tail.

PDF Abstract

Generalization through Memorization: Nearest Neighbor LLMs

The paper "Generalization through Memorization: Nearest Neighbor LLMs" by Khandelwal et al. explores the integration of Nearest Neighbor (k-NN) techniques into LLMs (LMs) to improve their ability to generalize. The authors present a hybrid approach that leverages the strengths of both traditional neural LLMs and memory-based methods.

Summary of Contributions

The primary contribution of this work is the introduction of k-NN-LMs, which augment a pre-trained neural LLM with a k-nearest neighbors retrieval mechanism. Specifically, the approach involves storing all training data in a key-value memory, where keys are representations of context embeddings and values are the next tokens. During inference, the model retrieves the nearest neighbors from this memory to inform its predictions.

Methodology

The k-NN-LM operates in two main stages:

Memory Augmentation: The memory consists of key-value pairs constructed from the training dataset. Keys are high-dimensional vectors derived from the context representations, and values are the corresponding next tokens.
Inference via Retrieval: At inference time, the model employs a retrieval mechanism to find the closest matching context embeddings in the memory. The retrieved tokens (values) are combined with the probability distribution of the base neural LLM to generate the final prediction.

Formally, the prediction of the k-NN-LM for the next token $w_i$ given context $h_i$ is computed as a combination of the base LM probability $P_{LM}(w_i \mid h_i)$ and the probability $P_{kNN}(w_i \mid h_i)$ derived from the retrieved neighbors.

Experimental Results

The authors conducted extensive experiments on standard LLMing benchmarks, including Wikitext-2 and Wikitext-103, demonstrating the efficacy of the proposed k-NN-LM. Key findings include:

Perplexity Reduction: The k-NN-LM achieves significant reductions in perplexity compared to strong baseline models. For instance, on the Wikitext-103 dataset, the k-NN-LM achieved a perplexity of 16.4, outperforming the previous state-of-the-art.
Adaptive Generalization: The model effectively adapts to novel contexts by leveraging the memory component, providing a robust mechanism for generalization through memorization. This is particularly evident in cases where the training data includes rare or outlier sequences.

Implications

The integration of k-NN retrieval mechanisms into LLMs has several noteworthy implications:

Enhanced Memory Capacity: By storing comprehensive representations of the training data, k-NN-LMs can recall and utilize specific contexts more effectively than traditional LMs.
Dynamic Adaptation: The model's ability to dynamically incorporate nearest neighbors during inference enables it to adapt to changes in data distribution without the need for retraining.
Scalability Concerns: While memory-based models show promise, they also pose challenges related to memory storage and retrieval efficiency, particularly for large-scale datasets.

Theoretical and Practical Considerations

From a theoretical perspective, the k-NN-LM represents a significant step toward bridging the gap between memory-based models and neural approaches. The model underscores the importance of balancing memorization with generalization in the design of LLMs.

Practically, the integration of k-NN mechanisms could inspire new directions in machine learning research, particularly in enhancing the adaptability and robustness of AI systems. Future work may explore more efficient memory retrieval techniques, as well as the application of k-NN-LMs across different domains and tasks.

Future Directions

Potential avenues for future research based on the findings of this paper include:

Memory Compression: Investigating techniques for compressing the memory storage to manage scalability issues.
Hybrid Architectures: Combining k-NN retrieval with advanced neural architectures such as transformers to further improve performance.
Transfer Learning: Assessing the effectiveness of k-NN-LMs in transfer learning scenarios where models are fine-tuned on different but related tasks.

In conclusion, the paper by Khandelwal et al. contributes a novel perspective to LLMing by incorporating k-NN methods, yielding impressive empirical results and opening new paths for research in generalization and memory integration in AI systems.