Training Language Models with Memory Augmentation (2205.12674v3)

Published 25 May 2022 in cs.CL and cs.LG

Abstract: Recent work has improved LLMs (LMs) remarkably by equipping them with a non-parametric memory component. However, most existing approaches only introduce mem-ories at testing time or represent them using a separately trained encoder, resulting in suboptimal training of the LLM. In this work, we present TRIME, a novel yet simple training approach designed for training LMs with memory augmentation. Our approach uses a training objective that directly takes in-batch examples as accessible memory. We also present new methods for memory construction and data batching, which are used for adapting to different sets of memories--local, long-term, and external memory--at testing time. We evaluate TRIME on multiple LLMing and machine translation benchmarks and show that it is able to achieve significant improvements across all the settings. Concretely, TRIME reduces the perplexity from 18.70 to 15.37 on WIKITEXT-103, by effectively leveraging a large memory set from the training corpus. Compared to standard LM training, TRIME adds negligible computational overhead and is compatible with different neural architectures, making it a versatile solution for training memory-augmented LMs.

Citations (121)

View on Semantic Scholar

Summary

The paper introduces TRIME, a memory augmentation method that leverages in-batch memories to improve the contextual utilization of language models.
It details the integration of three memory types—local, long-term, and external—using innovative batching strategies and contrastive learning techniques.
Empirical results, such as reducing WikiText-103 perplexity from 18.70 to 15.37, demonstrate TRIME’s efficiency and scalability with minimal computational overhead.

Training LLMs with Memory Augmentation: A Comprehensive Study

Introduction to Memory Augmentation in LLMs

Recent advancements in LLMs (LMs) have focused on integrating non-parametric memory components, enhancing the model's ability to capture and leverage contextual information from large datasets. This paper details TRIME (Training with In-batch Memories), a novel approach designed to optimize LLMs by integrating memory augmentation directly into the training process. Unlike traditional methods that incorporate memory units at the testing phase or use a separately trained encoder for memory representation, TRIME introduces a training objective and methods for memory construction and data batching that improve the model's interaction with local, long-term, and external memories during both training and testing.

Core Contributions and Methodology

The paper's central contribution lies in its unique training objective that leverages in-batch examples as accessible memory units. This objective is inspired by contrastive representation learning, aiming to align the hidden representation of the target token with both its embedding and a set of in-batch contextualized representations. This approach not only aids in handling rare words by falling back to word embeddings when in-batch memories do not contain the target token but also demonstrates the model's increased capacity to utilize contextual information over traditional LLMs.

Particularly notable is the paper's introduction of three memory types:

Local Memory: Reflects immediate past words modeled using attention mechanisms.
Long-term Memory: Captures context from the same document but outside the direct reach of attention due to input length constraints.
External Memory: Used to store vast amounts of data from the training corpus or additional datasets.

For each memory type, TRIME proposes innovative data batching strategies to efficiently construct and leverage these memories during training. The use of consecutive segments within a single batch allows the model to access long-term memories beyond its immediate context. Simultaneously, the batching of lexically similar segments from different documents as a proxy for external memory enhances the model's generalization capabilities.

Empirical Evaluations and Results

The TRIME model underwent extensive evaluation across multiple benchmarks, including LLMing and machine translation tasks. It significantly outperformed baseline models and existing approaches. For instance, on the WikiText-103 dataset, TRIME reduced the perplexity from 18.70 to 15.37 by efficiently utilizing large memory sets from the training corpus. This improvement was achieved with negligible computational overhead, underscoring TRIME's efficiency and scalability.

Theoretical Implications and Future Perspectives

Beyond its immediate performance gains, TRIME's approach opens new avenues for research into memory-augmented LLMs. By seamlessly integrating memory mechanisms into the training process, TRIME advances our understanding of how models can effectively leverage vast amounts of contextual data. It challenges the prevailing focus on post-hoc memory integration and standalone memory encoders, suggesting a more holistic approach to memory utilization in LLMs.

Conclusion

The TRIME model redefines the landscape of memory-augmented LLMing by embedding memory mechanisms directly into the training process. Its ability to harness local, long-term, and external memories without significant computational penalties marks a substantial step forward in the development of more efficient, context-aware LLMs. As such, TRIME not only achieves state-of-the-art results across several benchmarks but also lays the groundwork for future exploration of memory integration techniques in AI.

PDF Markdown