Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HMT: Hierarchical Memory Transformer for Long Context Language Processing (2405.06067v2)

Published 9 May 2024 in cs.CL and cs.LG
HMT: Hierarchical Memory Transformer for Long Context Language Processing

Abstract: Transformer-based LLMs (LLM) have been widely used in language processing applications. However, most of them restrict the context window that permits the model to attend to every token in the inputs. Previous works in recurrent models can memorize past tokens to enable unlimited context and maintain effectiveness. However, they have "flat" memory architectures, which have limitations in selecting and filtering information. Since humans are good at learning and self-adjustment, we speculate that imitating brain memory hierarchy is beneficial for model memorization. We propose the Hierarchical Memory Transformer (HMT), a novel framework that enables and improves models' long-context processing ability by imitating human memorization behavior. Leveraging memory-augmented segment-level recurrence, we organize the memory hierarchy by preserving tokens from early input token segments, passing memory embeddings along the sequence, and recalling relevant information from history. Evaluating general LLMing (Wikitext-103, PG-19) and question-answering tasks (PubMedQA), we show that HMT steadily improves the long-context processing ability of context-constrained and long-context models. With an additional 0.5% - 2% of parameters, HMT can easily plug in and augment future LLMs to handle long context effectively. Our code is open-sourced on Github: https://github.com/OswaldHe/HMT-pytorch.

Hierarchical Memory Transformer (HMT) for Long Context Processing

Introduction

Transformers have revolutionized NLP, but they do have a limitation: the maximum length of the context they can handle. The typical transformer models, including popular LLMs like Llama 2, process a fixed number of tokens at a time and are not well-suited for tasks requiring very long contexts, such as book summarization or document-based question answering.

The Hierarchical Memory Transformer (HMT) proposes a novel approach to extend the capabilities of transformers for long-context scenarios. It does so by mimicking how human memory works, utilizing a memory-augmented segment-level recurrence to handle longer contexts more effectively.

Hierarchical Memorization in HMT

HMT is designed to imitate the hierarchical structure of human memory, which consists of sensory, short-term, and long-term memory:

  • Sensory Memory: HMT uses the last few token embeddings from the previous segment, allowing it to process information that is immediately relevant.
  • Short-term Memory: Each segment is summarized into a single embedding. This summarized embedding is then used to recall relevant information from previously processed segments.
  • Long-term Memory: HMT maintains a cache of the most recent memory embeddings, effectively transforming it into a long-term memory bank. This cached memory is utilized to recall and integrate information from distant past segments.

Memory Recall Mechanism

The memory recall mechanism is one of the key innovations in HMT. It involves three main steps:

  1. Representation Extraction: The initial part of a segment is used to generate an embedding that summarizes the segment.
  2. Memory Search: This summary embedding is then used as a query to find the most relevant information from the cache of previous memory embeddings using a cross-attention mechanism.
  3. Augmenting Current Segment: The current segment is augmented with the recalled memory before being processed by the transformer model.

Training and Fine-tuning

The training process of HMT is divided into two stages to enhance efficiency:

  1. Initial Training: The model is trained to handle a few unrolled segments without memory recall.
  2. Extended Training: The pre-trained model is then extended with the memory recall mechanism and trained with a larger number of segments.

This multi-stage strategy allows HMT to train faster and achieve better performance on long-context tasks compared to single-stage training.

Experimental Results

HMT was tested using various datasets and transformer models to validate its effectiveness:

  • General LLMing: In tests with models such as OPT 2.7B and OpenLlamaV2 3B on Wikitext-103 and PG-19, HMT showed significant improvements. For OPT 2.7B, for example, HMT achieved a 25.5% decrease in perplexity on Wikitext-103, indicating much better LLMing performance over long contexts.
  • Question-Answering Tasks: With the PubMedQA dataset, HMT not only improved long-answer contextual reasoning by 9.81%, but also increased short-answer prediction accuracy by 1.0%.

Practical Implications

HMT offers several practical benefits:

  • Model Independence: HMT can be applied to any pre-trained model without altering the core architecture. This makes it a versatile enhancement for various transformer-based models.
  • Efficiency in Handling Long Contexts: By effectively managing long contexts with minimal additional parameters (0.5% to 2%), HMT is suitable for wide applications from book summarization to legal document processing.
  • Scalability: HMT can be scaled to even larger models and longer contexts with efficient GPU memory management techniques.

Speculations on Future Development

HMT opens the door for further innovations in memory-augmented neural networks:

  • Integrated Memory Hierarchies: Future developments could explore even more sophisticated memory hierarchies or adaptive memory management systems.
  • Enhancing Retrieval-Augmented Models: Combining HMT with other retrieval-augmented techniques may yield even more powerful models for long-context understanding and generation tasks.
  • Edge Device Deployment: Optimizations for deploying HMT on edge devices could unlock its potential for real-time applications in resource-constrained environments.

Conclusion

HMT represents a step forward in the handling of long contexts by LLMs, leveraging a memory system inspired by human cognition. It blends the strengths of recurrent models and transformers to robustly process long documents and text sequences, providing a valuable tool for a broad range of NLP applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Demystifying the nvidia ampere architecture through microbenchmarking and instruction-level analysis. In 2022 IEEE High Performance Extreme Computing Conference (HPEC), pp.  1–8. IEEE, 2022.
  2. Big data for natural language processing: A streaming approach. Knowledge-Based Systems, 79:36–42, 2015.
  3. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
  4. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  5. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.
  6. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35:11079–11091, 2022.
  7. Burgin, M. Epistemic information in stratified m-spaces. Information, 2(4):697–726, 2011.
  8. Hardware accelerators for recurrent neural networks on fpga. In 2017 IEEE International symposium on circuits and systems (ISCAS), pp.  1–4. IEEE, 2017.
  9. Recurrent neural networks hardware implementation on fpga. arXiv preprint arXiv:1511.05552, 2015.
  10. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014.
  11. Lifelong learning for question answering with hierarchical prompts. arXiv preprint arXiv:2208.14602, 2022.
  12. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  13. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  14. Unified language model pre-training for natural language understanding and generation. Advances in neural information processing systems, 32, 2019.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  17. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  18. Tapa: a scalable task-parallel dataflow programming framework for modern fpgas with co-optimization of hls and physical design. ACM Transactions on Reconfigurable Technology and Systems, 16(4):1–31, 2023.
  19. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  20. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  21. Streaming overlay architecture for lightweight lstm computation on fpga socs. ACM Transactions on Reconfigurable Technology and Systems, 16(1):1–26, 2022.
  22. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  23. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146, 2019.
  24. Pasta: Programming and automation support for scalable task-parallel hls programs on modern multi-die fpgas. In 2023 IEEE 31st Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), pp.  12–22. IEEE, 2023.
  25. Ultra-low latency recurrent neural network inference on fpgas for physics applications with hls4ml. Machine Learning: Science and Technology, 4(2):025004, 2023.
  26. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  27. Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593, 2019.
  28. Performance modeling in cuda streams—a means for high-throughput data processing. In 2014 IEEE international conference on big data (big data), pp.  301–310. IEEE, 2014.
  29. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  30. Ret-llm: Towards a general read-write memory for large language models. arXiv preprint arXiv:2305.14322, 2023.
  31. Efficient memory-enhanced transformer for long-document summarization in low-resource regimes. Sensors, 23(7):3542, 2023.
  32. Mozer, M. C. A focused backpropagation algorithm for temporal pattern recognition. In Backpropagation, pp.  137–169. Psychology Press, 2013.
  33. On the difficulty of training recurrent neural networks. In International conference on machine learning, pp.  1310–1318. Pmlr, 2013.
  34. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023.
  35. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  36. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  37. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.  3505–3506, 2020.
  38. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.  31210–31227. PMLR, 2023.
  39. Lamol: Language modeling for lifelong language learning. arXiv preprint arXiv:1909.03329, 2019.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  41. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  42. Augmenting language models with long-term memory. Advances in Neural Information Processing Systems, 36, 2024.
  43. Memformer: A memory-augmented transformer for sequence modeling. arXiv preprint arXiv:2010.06891, 2020.
  44. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  45. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024.
  46. An attention free transformer. arXiv preprint arXiv:2105.14103, 2021.
  47. Poolingformer: Long document modeling with pooling attention. In International Conference on Machine Learning, pp.  12437–12446. PMLR, 2021.
  48. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zifan He (7 papers)
  2. Zongyue Qin (10 papers)
  3. Neha Prakriya (6 papers)
  4. Yizhou Sun (149 papers)
  5. Jason Cong (62 papers)
Citations (2)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com