Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Compressive Transformers for Long-Range Sequence Modelling (1911.05507v1)

Published 13 Nov 2019 in cs.LG and stat.ML

Abstract: We present the Compressive Transformer, an attentive sequence model which compresses past memories for long-range sequence learning. We find the Compressive Transformer obtains state-of-the-art LLMling results in the WikiText-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. To promote the domain of long-range sequence learning, we propose a new open-vocabulary LLMling benchmark derived from books, PG-19.

Compressive Transformers for Long-Range Sequence Modelling

This paper introduces the Compressive Transformer, an innovative sequence model designed to enhance long-range sequence learning by efficiently compressing past memories. The authors demonstrate the model's capability by achieving state-of-the-art results on LLMling benchmarks such as WikiText-103 and Enwik8, with perplexities of 17.1 and 0.97 bits per character respectively. Furthermore, the Compressive Transformer shows promise in modelling high-frequency speech and as a memory mechanism in reinforcement learning (RL).

Overview

The Compressive Transformer builds upon the limitations of contemporary models like the TransformerXL. While traditional models such as LSTMs compress past states into a single state vector, the Transformer extends memory capacity by storing hidden states at every time step. However, this results in increased computational and storage costs. The proposed model, therefore, introduces a novel approach by compressing older memories into a coarser representation, balancing memory retention with computational efficiency.

Experimental Results

LLMling:

  • WikiText-103: The Compressive Transformer achieved a perplexity of 17.1, ahead of the TransformerXL baseline’s 18.1, demonstrating a significant improvement in modelling long-form text.
  • Enwik8: Achieving a new state-of-the-art BPC of 0.97, the model outperformed previous works by utilizing an effective combination of memory and compressed memory, highlighting its efficiency in character-level LLMling.

Speech Modelling:

  • The model was benchmarked against WaveNet and TransformerXL on unconditioned high-frequency speech, where it maintained competitive performance. This underscores its applicability beyond text-based tasks.

Reinforcement Learning:

  • In the object matching task within IMPALA’s RL framework, the Compressive Transformer demonstrated improved learning speed and stability compared to traditional memory mechanisms, emphasizing its potential in environments requiring long-term memory integration.

Implications and Future Work

This work paves the way for more efficient transformer models capable of handling tasks characterized by long-range dependencies. The introduction of compressed memory serves as an effective method to extend memory while mitigating the computational overhead. The PG-19 dataset proposed by the authors provides a more challenging benchmark for future long-range sequence models, containing longer contexts for improved evaluation of model capabilities.

Future research directions may include exploring adaptive compression rates, integrating shallow memory layers for more nuanced memory management, and testing the applicability of compressive strategies in diverse domains like video processing. The concept of compressive memory is foundational for scaling memory in artificial neural networks and may inspire future innovations in designing memory-efficient architectures.

Conclusion

The Compressive Transformer marks a notable contribution to sequence modelling by demonstrating how compressive techniques can enhance memory management in transformers. The model’s results across varied modalities illustrate its versatility and potential for broad applicability, setting a new standard for long-range sequence performance and efficiency.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Jack W. Rae (15 papers)
  2. Anna Potapenko (4 papers)
  3. Siddhant M. Jayakumar (13 papers)
  4. Timothy P. Lillicrap (19 papers)
Citations (563)