MemoryPrompt: A Light Wrapper to Improve Context Tracking in Pre-trained Language Models
Abstract: Transformer-based LMs track contextual information through large, hard-coded input windows. We introduce MemoryPrompt, a leaner approach in which the LM is complemented by a small auxiliary recurrent network that passes information to the LM by prefixing its regular input with a sequence of vectors, akin to soft prompts, without requiring LM finetuning. Tested on a task designed to probe a LM's ability to keep track of multiple fact updates, a MemoryPrompt-augmented LM outperforms much larger LMs that have access to the full input history. We also test MemoryPrompt on a long-distance dialogue dataset, where its performance is comparable to that of a model conditioned on the entire conversation history. In both experiments we also observe that, unlike full-finetuning approaches, MemoryPrompt does not suffer from catastrophic forgetting when adapted to new tasks, thus not disrupting the generalist capabilities of the underlying LM.
- Longformer: The long-document transformer. https://arxiv.org/abs/2004.05150.
- Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.
- Scaling Transformer to 1M tokens and beyond with RMT. https://arxiv.org/abs/2304.11062.
- Recurrent memory transformer. In Proceedings of NeurIPS, pages 11079–11091, New Orleans, LA.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595.
- Transformer-XL: Attentive language models beyond a fixed-length context. In Proceedings of ACL, pages 2978–2988, Florence, Italy.
- Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
- T-rex: A large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
- Addressing some limitations of transformers with feedback memory. https://arxiv.org/abs/2002.09402.
- Hybrid computing using a neural network with dynamic external memory. Nature, 538(7626):471–476.
- Sepp Hochreiter. 1998. The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 6(02):107–116.
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation, 9(8):1735–1780.
- Block-recurrent transformers. arXiv preprint arXiv:2203.07852.
- Armand Joulin and Tomas Mikolov. 2015. Inferring algorithmic patterns with stack-augmented recurrent nets. In Proceedings of NIPS, Montreal, Canada. Published online: https://papers.nips.cc/book/advances-in-neural-information-processing-systems-28-2015.
- A large-scale corpus for conversation disentanglement. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3846–3856, Florence, Italy. Association for Computational Linguistics.
- The power of scale for parameter-efficient prompt tuning. In Proceedings of EMNLP, pages 3045–3059, Punta Cana, Dominican Republic.
- Xiang Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of ACL, pages 4582–4597, Online.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. https://arxiv.org/abs/2107.13586.
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Pointer sentinel mixture models. Proceedings of ICLR.
- End-to-end memory networks. http://arxiv.org/abs/1503.08895.
- Attention is all you need. In Proceedings of NIPS, pages 5998–6008, Long Beach, CA.
- Neurons in large language models: Dead, n-gram, positional. https://arxiv.org/abs/2309.04827.
- Memformer: A memory-augmented transformer for sequence modeling. In Findings of ACL, pages 308–318, Online.
- Beyond goldfish memory: Long-term open-domain conversation. arXiv preprint arXiv:2107.07567.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Factual probing is [MASK]: Learning vs. learning to recall. In Proceedings of NAACL, pages 5017–5033, Online.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.