Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting (1907.00235v3)

Published 29 Jun 2019 in cs.LG and stat.ML

Abstract: Time series forecasting is an important problem across many domains, including predictions of solar plant energy output, electricity consumption, and traffic jam situation. In this paper, we propose to tackle such forecasting problem with Transformer [1]. Although impressed by its performance in our preliminary study, we found its two major weaknesses: (1) locality-agnostics: the point-wise dot-product self-attention in canonical Transformer architecture is insensitive to local context, which can make the model prone to anomalies in time series; (2) memory bottleneck: space complexity of canonical Transformer grows quadratically with sequence length $L$, making directly modeling long time series infeasible. In order to solve these two issues, we first propose convolutional self-attention by producing queries and keys with causal convolution so that local context can be better incorporated into attention mechanism. Then, we propose LogSparse Transformer with only $O(L(\log L){2})$ memory cost, improving forecasting accuracy for time series with fine granularity and strong long-term dependencies under constrained memory budget. Our experiments on both synthetic data and real-world datasets show that it compares favorably to the state-of-the-art.

Enhancing the Locality and Breaking the Memory Bottleneck of Transformer on Time Series Forecasting

The paper, authored by Shiyang Li et al. from the University of California, Santa Barbara, tackles two inherent weaknesses of the canonical Transformer architecture in the context of time series forecasting: locality-agnostics and memory bottleneck. The objective is to improve forecasting accuracy under constrained memory conditions while maintaining the Transformer’s strength in capturing long-term dependencies.

Time series forecasting is crucial across various domains, including energy production, electricity consumption, and traffic management. Traditional models like State Space Models (SSMs) and Autoregressive (AR) models have been the mainstay but suffer from limited scalability and the need for manual selection of trend components. Deep neural networks, particularly those based on Recurrent Neural Networks (RNNs) like LSTM and GRU, have addressed some scalability issues but still struggle with long-term dependencies due to training difficulties such as gradient vanishing and exploding.

Canonical Transformer Issues and Proposed Solutions

Canonical Transformer Limitations:

  1. Locality-Agnostics: The point-wise dot-product self-attention in canonical Transformer does not utilize local context efficiently, leading to potential problems while handling anomalies in time series data.
  2. Memory Bottleneck: The space complexity of self-attention grows quadratically with sequence length LL, making it infeasible to model long time series directly.

Enhancements:

  1. Convolutional Self-Attention:
    • The authors integrate causal convolutions into the self-attention mechanism, allowing queries and keys to incorporate local context. This approach mitigates the risk of anomalies derailing the model’s performance by focusing on local patterns and shapes in the time series data.
    • Empirical results indicate that this modification leads to lower training losses and better forecasting accuracy, particularly in challenging datasets with strong seasonal and recurrent patterns.
  2. LogSparse Transformer:
    • The proposed LogSparse Transformer reduces memory complexity to O(L(logL)2)O(L(\log L)^2) by enforcing sparse attention patterns. Instead of attending to all previous time steps, cells attend to a logarithmically spaced subset, retaining the ability to capture long-term dependencies via deeper stacking of sparse attention layers.
    • This sparse attention structure aligns with observed pattern-dependent sparsity in the natural learned patterns of the canonical Transformer, suggesting little to no performance degradation while significantly reducing memory usage.

Experimental Validation

Synthetic Data Experiments:

  • The authors constructed piece-wise sinusoidal signals to demonstrate that the Transformer model, especially with convolutional self-attention, excels in capturing long-term dependencies essential for accurate forecasting. Comparisons with DeepAR revealed that as the look-back window t0t_0 increases, DeepAR's performance degrades significantly, while the Transformer maintains accuracy.

Real-World Datasets Performance:

  • Extensive experiments on datasets such as electricity consumption (both coarse and fine granularities) and traffic data showcased that convolutional self-attention improves forecasting accuracy by better handling local dependencies.
  • The LogSparse Transformer, when evaluated under equivalent memory constraints compared to the canonical Transformer, outperformed its counterpart, particularly in traffic datasets exhibiting strong long-term dependencies.

Implications and Future Directions

The proposed advancements in Transformer architectures provide substantial improvements in the context of time series forecasting, balancing between capturing long-term dependencies and efficiently utilizing computational resources. Practically, these enhancements could lead to more accurate and resource-efficient forecasting systems in domains requiring high temporal granularity, such as energy load balancing and urban traffic management.

Theoretically, the integration of locality-aware mechanisms like convolutional self-attention and memory-efficient sparse attention patterns could inspire further developments in sequence modeling beyond time series forecasting, potentially benefiting fields like natural language processing and speech recognition.

Future work could explore optimizing the sparsity patterns further and extending these methods to smaller datasets or online learning scenarios where data availability evolves over time. The proposed methodologies open new avenues for efficiently tackling the scalability and locality issues inherent in deep learning models for sequential data.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shiyang Li (24 papers)
  2. Xiaoyong Jin (9 papers)
  3. Yao Xuan (7 papers)
  4. Xiyou Zhou (6 papers)
  5. Wenhu Chen (134 papers)
  6. Yu-Xiang Wang (124 papers)
  7. Xifeng Yan (52 papers)
Citations (1,243)
X Twitter Logo Streamline Icon: https://streamlinehq.com